# Comprobamos el environment
# ==============================================================================
import sys
print(sys.version)
print(sys.path)
print("---")
print(sys.executable)
3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 08:28:27) [Clang 14.0.6 ] ['/opt/anaconda3/envs/practica1b/lib/python312.zip', '/opt/anaconda3/envs/practica1b/lib/python3.12', '/opt/anaconda3/envs/practica1b/lib/python3.12/lib-dynload', '', '/Users/oscar/.local/lib/python3.12/site-packages', '/opt/anaconda3/envs/practica1b/lib/python3.12/site-packages', '/opt/anaconda3/envs/practica1b/lib/python3.12/site-packages/setuptools/_vendor'] --- /opt/anaconda3/envs/practica1b/bin/python
El objetivo de este notebook es realizar un análisis y preprocesamiento de las variables numéricas y categóricas, ajustando tipos de datos y dividiendo el conjunto en train y test.
Para las variables numéricas, se explora su distribución, se analizan las correlaciones y se gestionan los outliers y valores faltantes. En las variables categóricas, se imputan valores faltantes y se utilizan métodos como el coeficiente V de Cramer y el WOE para entender las relaciones entre ellas.
Las librerías utilizadas en este notebook son:
# Tratamiento de datos
# ==============================================================================
import pandas as pd
import numpy as np
# Gráficos
# ==============================================================================
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
from plotnine import *
# Otras
# ==============================================================================
from sklearn.impute import KNNImputer
import scipy.stats as ss
# Configuración pandas
# ==============================================================================
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')
Importamos las funciones necesarias para este notebook.
import sys
sys.path.append("../scr/")
import funciones
sys.path.remove('../scr/')
# Recarga el módulo
# ==============================================================================
import importlib
sys.path.append("../scr/")
importlib.reload(funciones)
sys.path.remove('../scr/')
Importamos los datos originales, ya que no he realizado ninguna modificación sobre ellos aún.
path_folder = "../data/raw/application_data.csv"
pd_loan = pd.read_csv(path_folder).set_index("SK_ID_CURR")
Recordamos las dimensiones y los tipos de datos con los que estamos trabajando:
# Dimensiones del dataset
# ==============================================================================
pd_loan.shape
(307511, 121)
# Tipo de cada columna
# ==============================================================================
pd_loan.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> Index: 307511 entries, 100002 to 456255 Data columns (total 121 columns): # Column Dtype --- ------ ----- 0 TARGET int64 1 NAME_CONTRACT_TYPE object 2 CODE_GENDER object 3 FLAG_OWN_CAR object 4 FLAG_OWN_REALTY object 5 CNT_CHILDREN int64 6 AMT_INCOME_TOTAL float64 7 AMT_CREDIT float64 8 AMT_ANNUITY float64 9 AMT_GOODS_PRICE float64 10 NAME_TYPE_SUITE object 11 NAME_INCOME_TYPE object 12 NAME_EDUCATION_TYPE object 13 NAME_FAMILY_STATUS object 14 NAME_HOUSING_TYPE object 15 REGION_POPULATION_RELATIVE float64 16 DAYS_BIRTH int64 17 DAYS_EMPLOYED int64 18 DAYS_REGISTRATION float64 19 DAYS_ID_PUBLISH int64 20 OWN_CAR_AGE float64 21 FLAG_MOBIL int64 22 FLAG_EMP_PHONE int64 23 FLAG_WORK_PHONE int64 24 FLAG_CONT_MOBILE int64 25 FLAG_PHONE int64 26 FLAG_EMAIL int64 27 OCCUPATION_TYPE object 28 CNT_FAM_MEMBERS float64 29 REGION_RATING_CLIENT int64 30 REGION_RATING_CLIENT_W_CITY int64 31 WEEKDAY_APPR_PROCESS_START object 32 HOUR_APPR_PROCESS_START int64 33 REG_REGION_NOT_LIVE_REGION int64 34 REG_REGION_NOT_WORK_REGION int64 35 LIVE_REGION_NOT_WORK_REGION int64 36 REG_CITY_NOT_LIVE_CITY int64 37 REG_CITY_NOT_WORK_CITY int64 38 LIVE_CITY_NOT_WORK_CITY int64 39 ORGANIZATION_TYPE object 40 EXT_SOURCE_1 float64 41 EXT_SOURCE_2 float64 42 EXT_SOURCE_3 float64 43 APARTMENTS_AVG float64 44 BASEMENTAREA_AVG float64 45 YEARS_BEGINEXPLUATATION_AVG float64 46 YEARS_BUILD_AVG float64 47 COMMONAREA_AVG float64 48 ELEVATORS_AVG float64 49 ENTRANCES_AVG float64 50 FLOORSMAX_AVG float64 51 FLOORSMIN_AVG float64 52 LANDAREA_AVG float64 53 LIVINGAPARTMENTS_AVG float64 54 LIVINGAREA_AVG float64 55 NONLIVINGAPARTMENTS_AVG float64 56 NONLIVINGAREA_AVG float64 57 APARTMENTS_MODE float64 58 BASEMENTAREA_MODE float64 59 YEARS_BEGINEXPLUATATION_MODE float64 60 YEARS_BUILD_MODE float64 61 COMMONAREA_MODE float64 62 ELEVATORS_MODE float64 63 ENTRANCES_MODE float64 64 FLOORSMAX_MODE float64 65 FLOORSMIN_MODE float64 66 LANDAREA_MODE float64 67 LIVINGAPARTMENTS_MODE float64 68 LIVINGAREA_MODE float64 69 NONLIVINGAPARTMENTS_MODE float64 70 NONLIVINGAREA_MODE float64 71 APARTMENTS_MEDI float64 72 BASEMENTAREA_MEDI float64 73 YEARS_BEGINEXPLUATATION_MEDI float64 74 YEARS_BUILD_MEDI float64 75 COMMONAREA_MEDI float64 76 ELEVATORS_MEDI float64 77 ENTRANCES_MEDI float64 78 FLOORSMAX_MEDI float64 79 FLOORSMIN_MEDI float64 80 LANDAREA_MEDI float64 81 LIVINGAPARTMENTS_MEDI float64 82 LIVINGAREA_MEDI float64 83 NONLIVINGAPARTMENTS_MEDI float64 84 NONLIVINGAREA_MEDI float64 85 FONDKAPREMONT_MODE object 86 HOUSETYPE_MODE object 87 TOTALAREA_MODE float64 88 WALLSMATERIAL_MODE object 89 EMERGENCYSTATE_MODE object 90 OBS_30_CNT_SOCIAL_CIRCLE float64 91 DEF_30_CNT_SOCIAL_CIRCLE float64 92 OBS_60_CNT_SOCIAL_CIRCLE float64 93 DEF_60_CNT_SOCIAL_CIRCLE float64 94 DAYS_LAST_PHONE_CHANGE float64 95 FLAG_DOCUMENT_2 int64 96 FLAG_DOCUMENT_3 int64 97 FLAG_DOCUMENT_4 int64 98 FLAG_DOCUMENT_5 int64 99 FLAG_DOCUMENT_6 int64 100 FLAG_DOCUMENT_7 int64 101 FLAG_DOCUMENT_8 int64 102 FLAG_DOCUMENT_9 int64 103 FLAG_DOCUMENT_10 int64 104 FLAG_DOCUMENT_11 int64 105 FLAG_DOCUMENT_12 int64 106 FLAG_DOCUMENT_13 int64 107 FLAG_DOCUMENT_14 int64 108 FLAG_DOCUMENT_15 int64 109 FLAG_DOCUMENT_16 int64 110 FLAG_DOCUMENT_17 int64 111 FLAG_DOCUMENT_18 int64 112 FLAG_DOCUMENT_19 int64 113 FLAG_DOCUMENT_20 int64 114 FLAG_DOCUMENT_21 int64 115 AMT_REQ_CREDIT_BUREAU_HOUR float64 116 AMT_REQ_CREDIT_BUREAU_DAY float64 117 AMT_REQ_CREDIT_BUREAU_WEEK float64 118 AMT_REQ_CREDIT_BUREAU_MON float64 119 AMT_REQ_CREDIT_BUREAU_QRT float64 120 AMT_REQ_CREDIT_BUREAU_YEAR float64 dtypes: float64(65), int64(40), object(16) memory usage: 286.2+ MB
# Nombre de cada columna
# ==============================================================================
pd_loan.columns
Index(['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
'AMT_ANNUITY', 'AMT_GOODS_PRICE',
...
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object', length=121)
# Cantidad de cada tipo datos en la columna
# ==============================================================================
pd_loan.dtypes.sort_values().to_frame('feature_type').groupby(by = 'feature_type').size().to_frame('count').reset_index()
| feature_type | count | |
|---|---|---|
| 0 | int64 | 40 |
| 1 | float64 | 65 |
| 2 | object | 16 |
Identificamos y clasificamos las variables del conjunto de datos en categóricas y continuas, asegurándome de que cada una tenga el tipo de dato correcto según lo que representa. Luego, ajusto los tipos de datos para asegurarme de que se puedan interpretar y manipular de manera adecuada en las etapas siguientes del análisis.
# Identificar las variables categóricas
# ==============================================================================
list_var_cat, other = funciones.dame_variables_categoricas(dataset=pd_loan)
pd_loan[list_var_cat] = pd_loan[list_var_cat].astype("category")
# Seleccionar las columnas que contienen datos numéricos continuos
# ==============================================================================
list_var_continuous = list(pd_loan.select_dtypes(['float', 'int']).columns)
pd_loan[list_var_continuous] = pd_loan[list_var_continuous].astype(float)
Es importante destacar que algunas variables clasificadas inicialmente como enteras son, en realidad, de naturaleza booleana, lo que las convierte en categóricas. Por otro lado, existen variables enteras que corresponden a valores continuos. Por esta razón, lo más seguro es que las variables almacenadas en other representan aquellas que, aunque son enteras, se consideran verdaderamente continuas.
other
['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START']
Es interesante destacar que las variables que representan diferencias de tiempo incluyen valores tanto negativos como positivos, ya que están definidas en relación al momento de la solicitud del crédito. Más adelante, se estudiarán con mayor detenimiento.
# Nuevo tipo de cada columna
# ==============================================================================
pd_loan.dtypes
TARGET category NAME_CONTRACT_TYPE category CODE_GENDER category FLAG_OWN_CAR category FLAG_OWN_REALTY category CNT_CHILDREN category AMT_INCOME_TOTAL float64 AMT_CREDIT float64 AMT_ANNUITY float64 AMT_GOODS_PRICE float64 NAME_TYPE_SUITE category NAME_INCOME_TYPE category NAME_EDUCATION_TYPE category NAME_FAMILY_STATUS category NAME_HOUSING_TYPE category REGION_POPULATION_RELATIVE float64 DAYS_BIRTH float64 DAYS_EMPLOYED float64 DAYS_REGISTRATION float64 DAYS_ID_PUBLISH float64 OWN_CAR_AGE float64 FLAG_MOBIL category FLAG_EMP_PHONE category FLAG_WORK_PHONE category FLAG_CONT_MOBILE category FLAG_PHONE category FLAG_EMAIL category OCCUPATION_TYPE category CNT_FAM_MEMBERS float64 REGION_RATING_CLIENT category REGION_RATING_CLIENT_W_CITY category WEEKDAY_APPR_PROCESS_START category HOUR_APPR_PROCESS_START float64 REG_REGION_NOT_LIVE_REGION category REG_REGION_NOT_WORK_REGION category LIVE_REGION_NOT_WORK_REGION category REG_CITY_NOT_LIVE_CITY category REG_CITY_NOT_WORK_CITY category LIVE_CITY_NOT_WORK_CITY category ORGANIZATION_TYPE category EXT_SOURCE_1 float64 EXT_SOURCE_2 float64 EXT_SOURCE_3 float64 APARTMENTS_AVG float64 BASEMENTAREA_AVG float64 YEARS_BEGINEXPLUATATION_AVG float64 YEARS_BUILD_AVG float64 COMMONAREA_AVG float64 ELEVATORS_AVG float64 ENTRANCES_AVG float64 FLOORSMAX_AVG float64 FLOORSMIN_AVG float64 LANDAREA_AVG float64 LIVINGAPARTMENTS_AVG float64 LIVINGAREA_AVG float64 NONLIVINGAPARTMENTS_AVG float64 NONLIVINGAREA_AVG float64 APARTMENTS_MODE float64 BASEMENTAREA_MODE float64 YEARS_BEGINEXPLUATATION_MODE float64 YEARS_BUILD_MODE float64 COMMONAREA_MODE float64 ELEVATORS_MODE float64 ENTRANCES_MODE float64 FLOORSMAX_MODE float64 FLOORSMIN_MODE float64 LANDAREA_MODE float64 LIVINGAPARTMENTS_MODE float64 LIVINGAREA_MODE float64 NONLIVINGAPARTMENTS_MODE float64 NONLIVINGAREA_MODE float64 APARTMENTS_MEDI float64 BASEMENTAREA_MEDI float64 YEARS_BEGINEXPLUATATION_MEDI float64 YEARS_BUILD_MEDI float64 COMMONAREA_MEDI float64 ELEVATORS_MEDI float64 ENTRANCES_MEDI float64 FLOORSMAX_MEDI float64 FLOORSMIN_MEDI float64 LANDAREA_MEDI float64 LIVINGAPARTMENTS_MEDI float64 LIVINGAREA_MEDI float64 NONLIVINGAPARTMENTS_MEDI float64 NONLIVINGAREA_MEDI float64 FONDKAPREMONT_MODE category HOUSETYPE_MODE category TOTALAREA_MODE float64 WALLSMATERIAL_MODE category EMERGENCYSTATE_MODE category OBS_30_CNT_SOCIAL_CIRCLE float64 DEF_30_CNT_SOCIAL_CIRCLE float64 OBS_60_CNT_SOCIAL_CIRCLE float64 DEF_60_CNT_SOCIAL_CIRCLE float64 DAYS_LAST_PHONE_CHANGE float64 FLAG_DOCUMENT_2 category FLAG_DOCUMENT_3 category FLAG_DOCUMENT_4 category FLAG_DOCUMENT_5 category FLAG_DOCUMENT_6 category FLAG_DOCUMENT_7 category FLAG_DOCUMENT_8 category FLAG_DOCUMENT_9 category FLAG_DOCUMENT_10 category FLAG_DOCUMENT_11 category FLAG_DOCUMENT_12 category FLAG_DOCUMENT_13 category FLAG_DOCUMENT_14 category FLAG_DOCUMENT_15 category FLAG_DOCUMENT_16 category FLAG_DOCUMENT_17 category FLAG_DOCUMENT_18 category FLAG_DOCUMENT_19 category FLAG_DOCUMENT_20 category FLAG_DOCUMENT_21 category AMT_REQ_CREDIT_BUREAU_HOUR float64 AMT_REQ_CREDIT_BUREAU_DAY float64 AMT_REQ_CREDIT_BUREAU_WEEK float64 AMT_REQ_CREDIT_BUREAU_MON float64 AMT_REQ_CREDIT_BUREAU_QRT float64 AMT_REQ_CREDIT_BUREAU_YEAR float64 dtype: object
Antes de entrenar el modelo, realizamos una separación estratificada de los datos en conjuntos de entrenamiento y prueba, para asegurar que la distribución de las clases en la variable objetivo se mantenga proporcional.
pd_plot_target = pd_loan['TARGET']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_target_conteo = pd_loan['TARGET'].value_counts().reset_index()
pd_plot_target_pc = pd.merge(pd_plot_target,
pd_plot_target_conteo, on=['TARGET'], how='inner')
# Visualizar el conteo de los valores de la variable objetivo
# ==============================================================================
pd_plot_target_pc['TARGET'] = pd_plot_target_pc['TARGET'].astype(str)
(ggplot(pd_plot_target_pc, aes(x='TARGET', y='percent', fill='TARGET'))
+ geom_bar(stat='identity', show_legend=False)
+ labs(
x='Variable objetivo',
y='Porcentaje',
title='Distribución de la variable objetivo'
)
+ scale_fill_manual(values=['navy', 'orange'])
+ theme_minimal()
+ theme(
plot_title=element_text(size=14, weight='bold', ha='center'),
axis_title=element_text(size=12),
axis_text=element_text(size=10),
)
)
Como hemos comentado en notebook anterior, se puede observar que el muestreo está desbalanceado, ya que la mayoría de los clientes no tienen dificultades para pagar el préstamo.
El 20% de los datos se asignan al conjunto de prueba y el 80% al conjunto de entrenamiento para entrenar el modelo con la mayoría de los datos y evaluar su desempeño con una muestra representativa y no utilizada en el entrenamiento.
from sklearn.model_selection import train_test_split
X_pd_loan, X_pd_loan_test, y_pd_loan, y_pd_loan_test = train_test_split(pd_loan.drop('TARGET',axis=1),
pd_loan['TARGET'],
stratify=pd_loan['TARGET'],
test_size=0.2)
pd_loan_train = pd.concat([X_pd_loan, y_pd_loan],axis=1)
pd_loan_test = pd.concat([X_pd_loan_test, y_pd_loan_test],axis=1)
print('== Train\n', pd_loan_train['TARGET'].value_counts(normalize=True))
print('== Test\n', pd_loan_test['TARGET'].value_counts(normalize=True))
== Train TARGET 0 0.919271 1 0.080729 Name: proportion, dtype: float64 == Test TARGET 0 0.919272 1 0.080728 Name: proportion, dtype: float64
La salida muestra que tanto en el conjunto de entrenamiento como en el de prueba, las proporciones de ambas clases son prácticamente idénticas, lo que nos confirma que la separación estratificada ha mantenido el balance de las clases en ambos conjuntos.
Este análisis muestra la cantidad de valores nulos por filas y por columnas en el conjunto de entrenamiento, ayudando a identificar qué variables o registros contienen datos faltantes.
# Encontrar nulos por columnas
# ==============================================================================
pd_series_null_columns = pd_loan_train.isnull().sum().sort_values(ascending=False)
# Encontrar nulos por filas
# ==============================================================================
pd_series_null_rows = pd_loan_train.isnull().sum(axis=1).sort_values(ascending=False)
# Dimensiones
# ==============================================================================
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])
pd_null_filas['target'] = pd_loan['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_loan_train.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_loan_train.shape[1]
(121,) (246008,)
pd_null_columnas
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| COMMONAREA_MODE | 171765 | 0.698209 |
| COMMONAREA_MEDI | 171765 | 0.698209 |
| COMMONAREA_AVG | 171765 | 0.698209 |
| NONLIVINGAPARTMENTS_MEDI | 170752 | 0.694091 |
| NONLIVINGAPARTMENTS_MODE | 170752 | 0.694091 |
| NONLIVINGAPARTMENTS_AVG | 170752 | 0.694091 |
| LIVINGAPARTMENTS_AVG | 168103 | 0.683323 |
| LIVINGAPARTMENTS_MEDI | 168103 | 0.683323 |
| LIVINGAPARTMENTS_MODE | 168103 | 0.683323 |
| FONDKAPREMONT_MODE | 168099 | 0.683307 |
| FLOORSMIN_MODE | 166833 | 0.678161 |
| FLOORSMIN_MEDI | 166833 | 0.678161 |
| FLOORSMIN_AVG | 166833 | 0.678161 |
| YEARS_BUILD_MEDI | 163466 | 0.664474 |
| YEARS_BUILD_AVG | 163466 | 0.664474 |
| YEARS_BUILD_MODE | 163466 | 0.664474 |
| OWN_CAR_AGE | 162379 | 0.660056 |
| LANDAREA_MEDI | 145984 | 0.593412 |
| LANDAREA_AVG | 145984 | 0.593412 |
| LANDAREA_MODE | 145984 | 0.593412 |
| BASEMENTAREA_MEDI | 143904 | 0.584957 |
| BASEMENTAREA_MODE | 143904 | 0.584957 |
| BASEMENTAREA_AVG | 143904 | 0.584957 |
| EXT_SOURCE_1 | 138763 | 0.564059 |
| NONLIVINGAREA_AVG | 135626 | 0.551307 |
| NONLIVINGAREA_MEDI | 135626 | 0.551307 |
| NONLIVINGAREA_MODE | 135626 | 0.551307 |
| ELEVATORS_AVG | 130969 | 0.532377 |
| ELEVATORS_MEDI | 130969 | 0.532377 |
| ELEVATORS_MODE | 130969 | 0.532377 |
| WALLSMATERIAL_MODE | 124985 | 0.508053 |
| APARTMENTS_AVG | 124774 | 0.507195 |
| APARTMENTS_MODE | 124774 | 0.507195 |
| APARTMENTS_MEDI | 124774 | 0.507195 |
| ENTRANCES_MODE | 123764 | 0.503089 |
| ENTRANCES_AVG | 123764 | 0.503089 |
| ENTRANCES_MEDI | 123764 | 0.503089 |
| LIVINGAREA_MODE | 123428 | 0.501724 |
| LIVINGAREA_AVG | 123428 | 0.501724 |
| LIVINGAREA_MEDI | 123428 | 0.501724 |
| HOUSETYPE_MODE | 123357 | 0.501435 |
| FLOORSMAX_MEDI | 122356 | 0.497366 |
| FLOORSMAX_AVG | 122356 | 0.497366 |
| FLOORSMAX_MODE | 122356 | 0.497366 |
| YEARS_BEGINEXPLUATATION_MEDI | 119930 | 0.487504 |
| YEARS_BEGINEXPLUATATION_MODE | 119930 | 0.487504 |
| YEARS_BEGINEXPLUATATION_AVG | 119930 | 0.487504 |
| TOTALAREA_MODE | 118675 | 0.482403 |
| EMERGENCYSTATE_MODE | 116515 | 0.473623 |
| OCCUPATION_TYPE | 77135 | 0.313547 |
| EXT_SOURCE_3 | 48826 | 0.198473 |
| AMT_REQ_CREDIT_BUREAU_MON | 33199 | 0.134951 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 33199 | 0.134951 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 33199 | 0.134951 |
| AMT_REQ_CREDIT_BUREAU_QRT | 33199 | 0.134951 |
| AMT_REQ_CREDIT_BUREAU_DAY | 33199 | 0.134951 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 33199 | 0.134951 |
| NAME_TYPE_SUITE | 1053 | 0.004280 |
| OBS_30_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| DEF_30_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| OBS_60_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| DEF_60_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| EXT_SOURCE_2 | 521 | 0.002118 |
| AMT_GOODS_PRICE | 229 | 0.000931 |
| AMT_ANNUITY | 10 | 0.000041 |
| CNT_FAM_MEMBERS | 1 | 0.000004 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000004 |
| FLAG_DOCUMENT_4 | 0 | 0.000000 |
| FLAG_DOCUMENT_7 | 0 | 0.000000 |
| FLAG_DOCUMENT_6 | 0 | 0.000000 |
| FLAG_DOCUMENT_5 | 0 | 0.000000 |
| FLAG_DOCUMENT_8 | 0 | 0.000000 |
| FLAG_DOCUMENT_12 | 0 | 0.000000 |
| FLAG_DOCUMENT_3 | 0 | 0.000000 |
| FLAG_DOCUMENT_2 | 0 | 0.000000 |
| FLAG_DOCUMENT_11 | 0 | 0.000000 |
| FLAG_DOCUMENT_21 | 0 | 0.000000 |
| FLAG_DOCUMENT_20 | 0 | 0.000000 |
| FLAG_DOCUMENT_19 | 0 | 0.000000 |
| FLAG_DOCUMENT_18 | 0 | 0.000000 |
| FLAG_DOCUMENT_17 | 0 | 0.000000 |
| FLAG_DOCUMENT_9 | 0 | 0.000000 |
| FLAG_DOCUMENT_16 | 0 | 0.000000 |
| FLAG_DOCUMENT_15 | 0 | 0.000000 |
| FLAG_DOCUMENT_14 | 0 | 0.000000 |
| FLAG_DOCUMENT_13 | 0 | 0.000000 |
| FLAG_DOCUMENT_10 | 0 | 0.000000 |
| NAME_CONTRACT_TYPE | 0 | 0.000000 |
| CODE_GENDER | 0 | 0.000000 |
| FLAG_MOBIL | 0 | 0.000000 |
| FLAG_OWN_CAR | 0 | 0.000000 |
| FLAG_OWN_REALTY | 0 | 0.000000 |
| CNT_CHILDREN | 0 | 0.000000 |
| AMT_INCOME_TOTAL | 0 | 0.000000 |
| AMT_CREDIT | 0 | 0.000000 |
| NAME_INCOME_TYPE | 0 | 0.000000 |
| NAME_EDUCATION_TYPE | 0 | 0.000000 |
| NAME_FAMILY_STATUS | 0 | 0.000000 |
| NAME_HOUSING_TYPE | 0 | 0.000000 |
| REGION_POPULATION_RELATIVE | 0 | 0.000000 |
| DAYS_BIRTH | 0 | 0.000000 |
| DAYS_EMPLOYED | 0 | 0.000000 |
| DAYS_REGISTRATION | 0 | 0.000000 |
| DAYS_ID_PUBLISH | 0 | 0.000000 |
| FLAG_EMP_PHONE | 0 | 0.000000 |
| ORGANIZATION_TYPE | 0 | 0.000000 |
| FLAG_WORK_PHONE | 0 | 0.000000 |
| FLAG_CONT_MOBILE | 0 | 0.000000 |
| FLAG_PHONE | 0 | 0.000000 |
| FLAG_EMAIL | 0 | 0.000000 |
| REGION_RATING_CLIENT | 0 | 0.000000 |
| REGION_RATING_CLIENT_W_CITY | 0 | 0.000000 |
| WEEKDAY_APPR_PROCESS_START | 0 | 0.000000 |
| HOUR_APPR_PROCESS_START | 0 | 0.000000 |
| REG_REGION_NOT_LIVE_REGION | 0 | 0.000000 |
| REG_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| LIVE_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_CITY_NOT_LIVE_CITY | 0 | 0.000000 |
| REG_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| LIVE_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
pd_null_filas.head()
| nulos_filas | target | porcentaje_filas | |
|---|---|---|---|
| SK_ID_CURR | |||
| 412312 | 61 | 0 | 0.504132 |
| 412671 | 61 | 0 | 0.504132 |
| 274127 | 61 | 0 | 0.504132 |
| 255145 | 61 | 0 | 0.504132 |
| 180861 | 61 | 0 | 0.504132 |
No se ha eliminado ninguna columna en este análisis, ya que no se ha identificado un porcentaje de valores nulos lo suficientemente alto como para justificarlo. Aunque en general es importante eliminar columnas con un exceso de nulos para evitar pérdida de información importante, en este caso todas las columnas se han conservado, considerando que incluso aquellas con valores faltantes podrían seguir siendo relevantes.
A continuación, visualizamos la distribución de las demás variables en el conjunto de datos, tanto de manera general como en función de la variable objetivo. Esto nos permitirá entender mejor cómo se distribuyen las variables y cómo podrían estar relacionadas con la presencia de dificultades de pago en los clientes (1: dificultades de pago, 0: sin dificultades de pago).
Para facilitar la interpretación y comprensión de los gráficos, primero se graficarán las variables continuas y, a continuación, las variables categóricas.
warnings.filterwarnings('ignore')
for i in list(pd_loan_train.columns):
if (pd_loan_train[i].dtype==float) & (i!='TARGET'):
funciones.plot_feature(pd_loan_train, col_name=i, isContinuous=True)
Empiezo analizando las variables continuas. Para ello, me fijo en la forma de la distribución en el histograma, ya que esto me puede indicar si los datos siguen una distribución normal, están sesgados o tienen múltiples picos. También observo el boxplot para identificar valores atípicos y entender la dispersión de los datos a través del rango intercuartílico. Además, examino cómo se relaciona la variable continua con la variable objetivo, buscando diferencias claras entre las clases. Por último, reviso la cantidad de valores nulos, ya que pueden influir en la calidad del análisis.
La primera variable que analizo es la población relativa por región. El boxplot muestra una ligera diferencia en función de la variable objetivo, con un rango intercuartílico más amplio en las regiones sin dificultades de pago. Esto podría sugerir que las regiones con mayor población tienen menos dificultades para cumplir con los pagos.
La siguiente variable es la edad en días de vida. El boxplot revela que los clientes más jóvenes tienen más dificultades para pagar, ya que la mediana de este grupo es más baja en comparación con aquellos que no presentan dificultades de pago.
También es interesante observar la relación con el coche de los clientes, los clientes que poseen coches de mayor antigüedad parecen tener más dificultades para pagar
Es útil analizar las variables externas (puntuación normalizada de una fuente de datos externa), cuyos histogramas siguen una distribución normal sesgada, con un pico a la derecha. Los boxplots también muestran medianas diferentes según la variable objetivo. Sin embargo, debido a la falta de información sobre la definición exacta de estas variables, no es posible sacar conclusiones definitivas.
Otra variable interesante a comentar es FLOORSMAX_ (información normalizada sobre el edificio donde vive el cliente). Aunque presenta una gran cantidad de valores nulos y outliers, lo que podría afectar la validez de las conclusiones, los valores más bajos parecen estar asociados con mayores dificultades de pago.
También quiero destacar el impacto del tiempo transcurrido desde que el cliente cambió su número de teléfono antes de solicitar la aplicación. Se observa que aquellos clientes que realizaron este cambio más recientemente tienden a presentar mayores dificultades para cumplir con sus pagos.
Por último, es importante destacar que, debido a la distribución de los datos y la forma en que se presentan, es complicado sacar conclusiones claras de algunas variables. Entre ellas se encuentran YEARS_BEGINEXPLUATATION_AVG y NONLIVINGAPARMENTS_AVG, entre otras.
warnings.filterwarnings('ignore')
for i in list(pd_loan_train.columns):
if (pd_loan_train[i].dtype!=float) & (i!='TARGET'):
# print(i)
funciones.plot_feature(pd_loan_train, col_name=i, isContinuous=False)
Para analizar las variables categóricas, inicio observando gráficos de barras, ya que son útiles para visualizar la distribución de las categorías y detectar posibles desbalances. Además, empleo gráficos de barras apiladas en función de la variable objetivo para analizar cómo se distribuyen las clases de TARGET entre las diferentes categorías. Si algunas categorías muestran una relación más marcada con la variable objetivo, esto puede indicar un mayor poder predictivo. También reviso la cantidad de valores nulos en estas variables, ya que su presencia puede afectar la calidad de los datos.
Es importante tener en cuenta que la variable objetivo está desbalanceada: el 91.93% de los casos corresponden a TARGET = 0 y solo el 8.07% a TARGET= 1. Este desequilibrio influye en la interpretación de los resultados y debe considerarse al realizar el análisis.
La primera variable analizada es el sexo del cliente. Aunque las diferencias son ligeras, se observa que los hombres, además de ser quienes solicitan menos préstamos, presentan una mayor probabilidad de enfrentar problemas con los pagos.
Una variable que destaca es el número de hijos de los clientes. Los datos revelan que los clientes con más hijos tienden a experimentar mayores dificultades para cumplir con los pagos. Sin embargo, dado que la mayoría de los clientes tiene entre 0 y 1 hijo, las conclusiones sobre aquellos con un número elevado de hijos se basan en una muestra limitada, lo que reduce su representatividad.
Otra variable relevante es el tipo de ingresos de los clientes. Los resultados muestran que las personas sin empleo y aquellas en baja por maternidad enfrentan mayores dificultades para pagar. Por el contrario, los clientes que son empresarios no presentan problemas significativos con sus pagos.
En cuanto a la ocupación, los gráficos revelan que los clientes con trabajos de baja cualificación (Low-Skill Laborers) tienen más dificultades para cumplir con sus pagos. Este hallazgo resalta el impacto de la estabilidad laboral y los niveles de ingreso en el comportamiento de pago.
La región de residencia también influye significativamente en la probabilidad de dificultades de pago. En particular, los clientes de la región 3 son los que presentan mayores problemas para cumplir con sus pagos, seguidos por los de la región 2. En contraste, los clientes de la región 1 son quienes tienen menos dificultades.
Otro aspecto interesante es la variable relacionada con el documento 2. Los datos indican que las personas que han entregado este documento tienen una mayor proporción de dificultades de pago. Sin embargo, dado que la mayoría de los clientes no han proporcionado este documento, esta conclusión podría no ser representativa del comportamiento general.
Por último, en las variables categóricas booleanas no he identificado información particularmente útil. Esto se debe a que, al tener solo dos valores posibles, su variabilidad es limitada, lo que dificulta la detección de patrones claros o relaciones significativas con la variable objetivo.
Es muy importante destacar que los datos de tipo entero que he definido como categóricos, incluidas las variables booleanas, no presentan ningún valor nulo.
A continuación, se tratarán tres aspectos clave del análisis de los datos: los valores faltantes (missing), las correlaciones entre las variables continuas y los valores atípicos (outliers), con el objetivo de limpiar y entender mejor los datos antes de construir el modelo.
Primero, recuerdo cuales son las variables que he considerado como continuas:
# Variables continuas
# ==============================================================================
list_var_continuous
['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
Los valores outliers se pueden sustituir por la media, mediana o utilizando valores extremos como la media ± 3 veces la desviación estándar.
Es importante destacar que, antes de tratar los outliers, se debe analizar su relación con la variable objetivo y comprender su contexto, ya que podrían representar casos relevantes o errores de medición que influyan en la predicción del modelo.
funciones.get_deviation_of_mean_perc(pd_loan_train, list_var_continuous, target='TARGET', multiplier=3)
| 0.0 | 1.0 | variable | sum_outlier_values | porcentaje_sum_null_values | |
|---|---|---|---|---|---|
| 0 | 0.000000 | 1.000000 | AMT_INCOME_TOTAL | 2193 | 0.008914 |
| 1 | 0.943000 | 0.057000 | AMT_INCOME_TOTAL | 2193 | 0.008914 |
| 2 | 0.000000 | 1.000000 | AMT_CREDIT | 2593 | 0.010540 |
| 3 | 0.957964 | 0.042036 | AMT_CREDIT | 2593 | 0.010540 |
| 4 | 0.000000 | 1.000000 | AMT_ANNUITY | 2353 | 0.009565 |
| 5 | 0.962176 | 0.037824 | AMT_ANNUITY | 2353 | 0.009565 |
| 6 | 0.000000 | 1.000000 | AMT_GOODS_PRICE | 3322 | 0.013504 |
| 7 | 0.961770 | 0.038230 | AMT_GOODS_PRICE | 3322 | 0.013504 |
| 8 | 0.000000 | 1.000000 | REGION_POPULATION_RELATIVE | 6745 | 0.027418 |
| 9 | 0.959377 | 0.040623 | REGION_POPULATION_RELATIVE | 6745 | 0.027418 |
| 10 | 0.000000 | 1.000000 | DAYS_REGISTRATION | 604 | 0.002455 |
| 11 | 0.953642 | 0.046358 | DAYS_REGISTRATION | 604 | 0.002455 |
| 12 | 0.000000 | 1.000000 | OWN_CAR_AGE | 2689 | 0.010931 |
| 13 | 0.923763 | 0.076237 | OWN_CAR_AGE | 2689 | 0.010931 |
| 14 | 0.000000 | 1.000000 | CNT_FAM_MEMBERS | 3223 | 0.013101 |
| 15 | 0.898852 | 0.101148 | CNT_FAM_MEMBERS | 3223 | 0.013101 |
| 16 | 0.000000 | 1.000000 | HOUR_APPR_PROCESS_START | 495 | 0.002012 |
| 17 | 0.896970 | 0.103030 | HOUR_APPR_PROCESS_START | 495 | 0.002012 |
| 18 | 0.000000 | 1.000000 | APARTMENTS_AVG | 2389 | 0.009711 |
| 19 | 0.947677 | 0.052323 | APARTMENTS_AVG | 2389 | 0.009711 |
| 20 | 0.000000 | 1.000000 | BASEMENTAREA_AVG | 1577 | 0.006410 |
| 21 | 0.946734 | 0.053266 | BASEMENTAREA_AVG | 1577 | 0.006410 |
| 22 | 0.000000 | 1.000000 | YEARS_BEGINEXPLUATATION_AVG | 544 | 0.002211 |
| 23 | 0.915441 | 0.084559 | YEARS_BEGINEXPLUATATION_AVG | 544 | 0.002211 |
| 24 | 0.000000 | 1.000000 | YEARS_BUILD_AVG | 970 | 0.003943 |
| 25 | 0.927835 | 0.072165 | YEARS_BUILD_AVG | 970 | 0.003943 |
| 26 | 0.000000 | 1.000000 | COMMONAREA_AVG | 1355 | 0.005508 |
| 27 | 0.949077 | 0.050923 | COMMONAREA_AVG | 1355 | 0.005508 |
| 28 | 0.000000 | 1.000000 | ELEVATORS_AVG | 1961 | 0.007971 |
| 29 | 0.952065 | 0.047935 | ELEVATORS_AVG | 1961 | 0.007971 |
| 30 | 0.000000 | 1.000000 | ENTRANCES_AVG | 1763 | 0.007166 |
| 31 | 0.936472 | 0.063528 | ENTRANCES_AVG | 1763 | 0.007166 |
| 32 | 0.000000 | 1.000000 | FLOORSMAX_AVG | 2095 | 0.008516 |
| 33 | 0.958473 | 0.041527 | FLOORSMAX_AVG | 2095 | 0.008516 |
| 34 | 0.000000 | 1.000000 | FLOORSMIN_AVG | 464 | 0.001886 |
| 35 | 0.961207 | 0.038793 | FLOORSMIN_AVG | 464 | 0.001886 |
| 36 | 0.000000 | 1.000000 | LANDAREA_AVG | 1662 | 0.006756 |
| 37 | 0.932611 | 0.067389 | LANDAREA_AVG | 1662 | 0.006756 |
| 38 | 0.000000 | 1.000000 | LIVINGAPARTMENTS_AVG | 1402 | 0.005699 |
| 39 | 0.946505 | 0.053495 | LIVINGAPARTMENTS_AVG | 1402 | 0.005699 |
| 40 | 0.000000 | 1.000000 | LIVINGAREA_AVG | 2565 | 0.010426 |
| 41 | 0.946199 | 0.053801 | LIVINGAREA_AVG | 2565 | 0.010426 |
| 42 | 0.000000 | 1.000000 | NONLIVINGAPARTMENTS_AVG | 568 | 0.002309 |
| 43 | 0.926056 | 0.073944 | NONLIVINGAPARTMENTS_AVG | 568 | 0.002309 |
| 44 | 0.000000 | 1.000000 | NONLIVINGAREA_AVG | 1943 | 0.007898 |
| 45 | 0.951107 | 0.048893 | NONLIVINGAREA_AVG | 1943 | 0.007898 |
| 46 | 0.000000 | 1.000000 | APARTMENTS_MODE | 2405 | 0.009776 |
| 47 | 0.946778 | 0.053222 | APARTMENTS_MODE | 2405 | 0.009776 |
| 48 | 0.000000 | 1.000000 | BASEMENTAREA_MODE | 1637 | 0.006654 |
| 49 | 0.944411 | 0.055589 | BASEMENTAREA_MODE | 1637 | 0.006654 |
| 50 | 0.000000 | 1.000000 | YEARS_BEGINEXPLUATATION_MODE | 533 | 0.002167 |
| 51 | 0.913696 | 0.086304 | YEARS_BEGINEXPLUATATION_MODE | 533 | 0.002167 |
| 52 | 0.000000 | 1.000000 | YEARS_BUILD_MODE | 984 | 0.004000 |
| 53 | 0.927846 | 0.072154 | YEARS_BUILD_MODE | 984 | 0.004000 |
| 54 | 0.000000 | 1.000000 | COMMONAREA_MODE | 1351 | 0.005492 |
| 55 | 0.943005 | 0.056995 | COMMONAREA_MODE | 1351 | 0.005492 |
| 56 | 0.000000 | 1.000000 | ELEVATORS_MODE | 2701 | 0.010979 |
| 57 | 0.949648 | 0.050352 | ELEVATORS_MODE | 2701 | 0.010979 |
| 58 | 0.000000 | 1.000000 | ENTRANCES_MODE | 2099 | 0.008532 |
| 59 | 0.939495 | 0.060505 | ENTRANCES_MODE | 2099 | 0.008532 |
| 60 | 0.000000 | 1.000000 | FLOORSMAX_MODE | 2111 | 0.008581 |
| 61 | 0.958787 | 0.041213 | FLOORSMAX_MODE | 2111 | 0.008581 |
| 62 | 0.000000 | 1.000000 | FLOORSMIN_MODE | 381 | 0.001549 |
| 63 | 0.963255 | 0.036745 | FLOORSMIN_MODE | 381 | 0.001549 |
| 64 | 0.000000 | 1.000000 | LANDAREA_MODE | 1705 | 0.006931 |
| 65 | 0.933724 | 0.066276 | LANDAREA_MODE | 1705 | 0.006931 |
| 66 | 0.000000 | 1.000000 | LIVINGAPARTMENTS_MODE | 1442 | 0.005862 |
| 67 | 0.943828 | 0.056172 | LIVINGAPARTMENTS_MODE | 1442 | 0.005862 |
| 68 | 0.000000 | 1.000000 | LIVINGAREA_MODE | 2698 | 0.010967 |
| 69 | 0.945515 | 0.054485 | LIVINGAREA_MODE | 2698 | 0.010967 |
| 70 | 0.000000 | 1.000000 | NONLIVINGAPARTMENTS_MODE | 535 | 0.002175 |
| 71 | 0.917757 | 0.082243 | NONLIVINGAPARTMENTS_MODE | 535 | 0.002175 |
| 72 | 0.000000 | 1.000000 | NONLIVINGAREA_MODE | 1978 | 0.008040 |
| 73 | 0.952477 | 0.047523 | NONLIVINGAREA_MODE | 1978 | 0.008040 |
| 74 | 0.000000 | 1.000000 | APARTMENTS_MEDI | 2429 | 0.009874 |
| 75 | 0.947303 | 0.052697 | APARTMENTS_MEDI | 2429 | 0.009874 |
| 76 | 0.000000 | 1.000000 | BASEMENTAREA_MEDI | 1579 | 0.006418 |
| 77 | 0.946168 | 0.053832 | BASEMENTAREA_MEDI | 1579 | 0.006418 |
| 78 | 0.000000 | 1.000000 | YEARS_BEGINEXPLUATATION_MEDI | 514 | 0.002089 |
| 79 | 0.912451 | 0.087549 | YEARS_BEGINEXPLUATATION_MEDI | 514 | 0.002089 |
| 80 | 0.000000 | 1.000000 | YEARS_BUILD_MEDI | 980 | 0.003984 |
| 81 | 0.928571 | 0.071429 | YEARS_BUILD_MEDI | 980 | 0.003984 |
| 82 | 0.000000 | 1.000000 | COMMONAREA_MEDI | 1371 | 0.005573 |
| 83 | 0.947484 | 0.052516 | COMMONAREA_MEDI | 1371 | 0.005573 |
| 84 | 0.000000 | 1.000000 | ELEVATORS_MEDI | 1954 | 0.007943 |
| 85 | 0.952405 | 0.047595 | ELEVATORS_MEDI | 1954 | 0.007943 |
| 86 | 0.000000 | 1.000000 | ENTRANCES_MEDI | 1770 | 0.007195 |
| 87 | 0.936158 | 0.063842 | ENTRANCES_MEDI | 1770 | 0.007195 |
| 88 | 0.000000 | 1.000000 | FLOORSMAX_MEDI | 2188 | 0.008894 |
| 89 | 0.957952 | 0.042048 | FLOORSMAX_MEDI | 2188 | 0.008894 |
| 90 | 0.000000 | 1.000000 | FLOORSMIN_MEDI | 439 | 0.001784 |
| 91 | 0.961276 | 0.038724 | FLOORSMIN_MEDI | 439 | 0.001784 |
| 92 | 0.000000 | 1.000000 | LANDAREA_MEDI | 1705 | 0.006931 |
| 93 | 0.935484 | 0.064516 | LANDAREA_MEDI | 1705 | 0.006931 |
| 94 | 0.000000 | 1.000000 | LIVINGAPARTMENTS_MEDI | 1417 | 0.005760 |
| 95 | 0.944954 | 0.055046 | LIVINGAPARTMENTS_MEDI | 1417 | 0.005760 |
| 96 | 0.000000 | 1.000000 | LIVINGAREA_MEDI | 2586 | 0.010512 |
| 97 | 0.947409 | 0.052591 | LIVINGAREA_MEDI | 2586 | 0.010512 |
| 98 | 0.000000 | 1.000000 | NONLIVINGAPARTMENTS_MEDI | 567 | 0.002305 |
| 99 | 0.924162 | 0.075838 | NONLIVINGAPARTMENTS_MEDI | 567 | 0.002305 |
| 100 | 0.000000 | 1.000000 | NONLIVINGAREA_MEDI | 1960 | 0.007967 |
| 101 | 0.951531 | 0.048469 | NONLIVINGAREA_MEDI | 1960 | 0.007967 |
| 102 | 0.000000 | 1.000000 | TOTALAREA_MODE | 2676 | 0.010878 |
| 103 | 0.952541 | 0.047459 | TOTALAREA_MODE | 2676 | 0.010878 |
| 104 | 0.000000 | 1.000000 | OBS_30_CNT_SOCIAL_CIRCLE | 4891 | 0.019881 |
| 105 | 0.912492 | 0.087508 | OBS_30_CNT_SOCIAL_CIRCLE | 4891 | 0.019881 |
| 106 | 0.000000 | 1.000000 | DEF_30_CNT_SOCIAL_CIRCLE | 5453 | 0.022166 |
| 107 | 0.877315 | 0.122685 | DEF_30_CNT_SOCIAL_CIRCLE | 5453 | 0.022166 |
| 108 | 0.000000 | 1.000000 | OBS_60_CNT_SOCIAL_CIRCLE | 4750 | 0.019308 |
| 109 | 0.912632 | 0.087368 | OBS_60_CNT_SOCIAL_CIRCLE | 4750 | 0.019308 |
| 110 | 0.000000 | 1.000000 | DEF_60_CNT_SOCIAL_CIRCLE | 3146 | 0.012788 |
| 111 | 0.870947 | 0.129053 | DEF_60_CNT_SOCIAL_CIRCLE | 3146 | 0.012788 |
| 112 | 0.000000 | 1.000000 | DAYS_LAST_PHONE_CHANGE | 505 | 0.002053 |
| 113 | 0.960396 | 0.039604 | DAYS_LAST_PHONE_CHANGE | 505 | 0.002053 |
| 114 | 0.000000 | 1.000000 | AMT_REQ_CREDIT_BUREAU_HOUR | 1323 | 0.005378 |
| 115 | 0.919123 | 0.080877 | AMT_REQ_CREDIT_BUREAU_HOUR | 1323 | 0.005378 |
| 116 | 0.000000 | 1.000000 | AMT_REQ_CREDIT_BUREAU_DAY | 1198 | 0.004870 |
| 117 | 0.906511 | 0.093489 | AMT_REQ_CREDIT_BUREAU_DAY | 1198 | 0.004870 |
| 118 | 0.000000 | 1.000000 | AMT_REQ_CREDIT_BUREAU_WEEK | 6848 | 0.027836 |
| 119 | 0.924065 | 0.075935 | AMT_REQ_CREDIT_BUREAU_WEEK | 6848 | 0.027836 |
| 120 | 0.000000 | 1.000000 | AMT_REQ_CREDIT_BUREAU_MON | 2596 | 0.010553 |
| 121 | 0.943374 | 0.056626 | AMT_REQ_CREDIT_BUREAU_MON | 2596 | 0.010553 |
| 122 | 0.000000 | 1.000000 | AMT_REQ_CREDIT_BUREAU_QRT | 1842 | 0.007488 |
| 123 | 0.915309 | 0.084691 | AMT_REQ_CREDIT_BUREAU_QRT | 1842 | 0.007488 |
| 124 | 0.000000 | 1.000000 | AMT_REQ_CREDIT_BUREAU_YEAR | 2661 | 0.010817 |
| 125 | 0.903796 | 0.096204 | AMT_REQ_CREDIT_BUREAU_YEAR | 2661 | 0.010817 |
Tras el análisis exploratorio, he decidido, como primera iteración, no sustituir los valores atípicos, ya que es importante evaluar su impacto en el modelo. Esta decisión también se debe a la gran cantidad de outliers observados, algunos de los cuales superan el 2% de los datos. Se puede observar que el porcentaje de la variable objetivo en cada variable cambiará al no considerar los outliers. Una vez construido el modelo, puedo realizar iteraciones utilizando diferentes métodos de tratamiento para evaluar si estos mejoran el rendimiento
En esta sección se analizarán las correlaciones entre las variables continuas utilizando la matriz de correlación de Pearson, que mide la fuerza y la dirección de una relación lineal entre dos variables numéricas. Los valores de Pearson oscilan entre -1 y 1, donde 1 indica una relación positiva perfecta, -1 una relación negativa perfecta, y 0 señala la ausencia de una relación lineal.
Es importante mencionar que se establece la autocorrelación en 0 para evitar distracción visual y centrar el análisis en relaciones entre distintas variables
funciones.get_corr_matrix(dataset = pd_loan_train[list_var_continuous],
metodo='pearson', size_figure=[10,8])
0
En la matriz de correlación se pueden ver varias correlaciones de Pearson cercanas o iguales a 1, lo que indica que algunas variables están perfectamente correlacionadas. Esto podría ser una señal de que hay variables redundantes o derivadas de otras, lo que podría generar problemas de multicolinealidad en los modelos estadísticos.
Un patrón interesante es que las variables relacionadas con medidas como la media, mediana y moda de una misma característica muestran correlaciones cercanas a 1. Esto sugiere que estas variables podrían estar representando prácticamente la misma información.
Más adelante, se revisarán estas variables y, si es necesario, se eliminarán aquellas que sean idénticas o estén altamente correlacionadas para evitar que afecten los resultados del modelo.
En el caso de algoritmos basados en árboles, como XGBoost y Random Forest, la multicolinealidad no representa un problema mayor, ya que estos modelos no necesitan que las variables sean independientes entre sí para hacer predicciones. De hecho, son capaces de manejar variables correlacionadas de forma eficiente sin que esto afecte su rendimiento.
Sin embargo, en modelos lineales como los GLM (Modelos Lineales Generalizados), la multicolinealidad sí puede afectar la estabilidad e interpretación de los coeficientes. Esto ocurre porque la alta correlación entre variables puede inflar los errores estándar, generando estimaciones inexactas. En estos casos, será importante eliminar o reducir la colinealidad antes de entrenar el modelo.
# Correlación de Pearson variables continuas
# ==============================================================================
corr = pd_loan_train[list_var_continuous].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1)
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.6]
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 4198 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 0.998507 |
| 3192 | YEARS_BUILD_MEDI | YEARS_BUILD_AVG | 0.998433 |
| 3542 | FLOORSMIN_MEDI | FLOORSMIN_AVG | 0.997350 |
| 3472 | FLOORSMAX_MEDI | FLOORSMAX_AVG | 0.996987 |
| 3402 | ENTRANCES_MEDI | ENTRANCES_AVG | 0.996868 |
| 3332 | ELEVATORS_MEDI | ELEVATORS_AVG | 0.996301 |
| 3262 | COMMONAREA_MEDI | COMMONAREA_AVG | 0.995773 |
| 3752 | LIVINGAREA_MEDI | LIVINGAREA_AVG | 0.995569 |
| 3122 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_AVG | 0.994784 |
| 2982 | APARTMENTS_MEDI | APARTMENTS_AVG | 0.994586 |
| 3052 | BASEMENTAREA_MEDI | BASEMENTAREA_AVG | 0.994424 |
| 3822 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAPARTMENTS_AVG | 0.994315 |
| 3682 | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | 0.992994 |
| 3892 | NONLIVINGAREA_MEDI | NONLIVINGAREA_AVG | 0.991681 |
| 3612 | LANDAREA_MEDI | LANDAREA_AVG | 0.991270 |
| 3206 | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | 0.989588 |
| 2226 | YEARS_BUILD_MODE | YEARS_BUILD_AVG | 0.989509 |
| 3486 | FLOORSMAX_MEDI | FLOORSMAX_MODE | 0.988376 |
| 3556 | FLOORSMIN_MEDI | FLOORSMIN_MODE | 0.988346 |
| 208 | AMT_GOODS_PRICE | AMT_CREDIT | 0.986961 |
| 2576 | FLOORSMIN_MODE | FLOORSMIN_AVG | 0.985898 |
| 2506 | FLOORSMAX_MODE | FLOORSMAX_AVG | 0.985728 |
| 3346 | ELEVATORS_MEDI | ELEVATORS_MODE | 0.983063 |
| 3416 | ENTRANCES_MEDI | ENTRANCES_MODE | 0.980799 |
| 3626 | LANDAREA_MEDI | LANDAREA_MODE | 0.980337 |
| 3276 | COMMONAREA_MEDI | COMMONAREA_MODE | 0.980305 |
| 3836 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAPARTMENTS_MODE | 0.979433 |
| 2366 | ELEVATORS_MODE | ELEVATORS_AVG | 0.979193 |
| 2436 | ENTRANCES_MODE | ENTRANCES_AVG | 0.977770 |
| 3066 | BASEMENTAREA_MEDI | BASEMENTAREA_MODE | 0.977537 |
| 2296 | COMMONAREA_MODE | COMMONAREA_AVG | 0.977306 |
| 2996 | APARTMENTS_MEDI | APARTMENTS_MODE | 0.977249 |
| 3906 | NONLIVINGAREA_MEDI | NONLIVINGAREA_MODE | 0.975850 |
| 3696 | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_MODE | 0.975408 |
| 3766 | LIVINGAREA_MEDI | LIVINGAREA_MODE | 0.974808 |
| 2856 | NONLIVINGAPARTMENTS_MODE | NONLIVINGAPARTMENTS_AVG | 0.973941 |
| 2156 | YEARS_BEGINEXPLUATATION_MODE | YEARS_BEGINEXPLUATATION_AVG | 0.973210 |
| 2086 | BASEMENTAREA_MODE | BASEMENTAREA_AVG | 0.973177 |
| 2646 | LANDAREA_MODE | LANDAREA_AVG | 0.973008 |
| 2016 | APARTMENTS_MODE | APARTMENTS_AVG | 0.972709 |
| 2786 | LIVINGAREA_MODE | LIVINGAREA_AVG | 0.972126 |
| 2716 | LIVINGAPARTMENTS_MODE | LIVINGAPARTMENTS_AVG | 0.969061 |
| 2926 | NONLIVINGAREA_MODE | NONLIVINGAREA_AVG | 0.967384 |
| 3136 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | 0.966340 |
| 1740 | LIVINGAPARTMENTS_AVG | APARTMENTS_AVG | 0.942701 |
| 3700 | LIVINGAPARTMENTS_MEDI | APARTMENTS_MEDI | 0.941122 |
| 3672 | LIVINGAPARTMENTS_MEDI | APARTMENTS_AVG | 0.940545 |
| 2720 | LIVINGAPARTMENTS_MODE | APARTMENTS_MODE | 0.937027 |
| 2992 | APARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | 0.933402 |
| 3006 | APARTMENTS_MEDI | LIVINGAPARTMENTS_MODE | 0.931129 |
| 2706 | LIVINGAPARTMENTS_MODE | APARTMENTS_AVG | 0.929157 |
| 3959 | TOTALAREA_MODE | LIVINGAREA_AVG | 0.924305 |
| 3987 | TOTALAREA_MODE | LIVINGAREA_MEDI | 0.918768 |
| 3769 | LIVINGAREA_MEDI | APARTMENTS_MEDI | 0.914772 |
| 3686 | LIVINGAPARTMENTS_MEDI | APARTMENTS_MODE | 0.913641 |
| 1809 | LIVINGAREA_AVG | APARTMENTS_AVG | 0.912009 |
| 2993 | APARTMENTS_MEDI | LIVINGAREA_AVG | 0.911236 |
| 3741 | LIVINGAREA_MEDI | APARTMENTS_AVG | 0.910805 |
| 2789 | LIVINGAREA_MODE | APARTMENTS_MODE | 0.908617 |
| 2026 | APARTMENTS_MODE | LIVINGAPARTMENTS_AVG | 0.906482 |
| 3973 | TOTALAREA_MODE | LIVINGAREA_MODE | 0.898018 |
| 3007 | APARTMENTS_MEDI | LIVINGAREA_MODE | 0.894758 |
| 3755 | LIVINGAREA_MEDI | APARTMENTS_MODE | 0.892956 |
| 2775 | LIVINGAREA_MODE | APARTMENTS_AVG | 0.891694 |
| 3948 | TOTALAREA_MODE | APARTMENTS_AVG | 0.889941 |
| 2027 | APARTMENTS_MODE | LIVINGAREA_AVG | 0.889435 |
| 3976 | TOTALAREA_MODE | APARTMENTS_MEDI | 0.884403 |
| 3779 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_MEDI | 0.880944 |
| 3683 | LIVINGAPARTMENTS_MEDI | LIVINGAREA_AVG | 0.879678 |
| 1819 | LIVINGAREA_AVG | LIVINGAPARTMENTS_AVG | 0.876364 |
| 2799 | LIVINGAREA_MODE | LIVINGAPARTMENTS_MODE | 0.876074 |
| 3751 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_AVG | 0.873789 |
| 3765 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_MODE | 0.871375 |
| 2717 | LIVINGAPARTMENTS_MODE | LIVINGAREA_AVG | 0.870303 |
| 3774 | LIVINGAREA_MEDI | ELEVATORS_MEDI | 0.868221 |
| 1814 | LIVINGAREA_AVG | ELEVATORS_AVG | 0.867826 |
| 3338 | ELEVATORS_MEDI | LIVINGAREA_AVG | 0.865688 |
| 3746 | LIVINGAREA_MEDI | ELEVATORS_AVG | 0.865651 |
| 4268 | DEF_60_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | 0.862182 |
| 3962 | TOTALAREA_MODE | APARTMENTS_MODE | 0.860982 |
| 3760 | LIVINGAREA_MEDI | ELEVATORS_MODE | 0.855928 |
| 2794 | LIVINGAREA_MODE | ELEVATORS_MODE | 0.855533 |
| 3697 | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MODE | 0.854546 |
| 2372 | ELEVATORS_MODE | LIVINGAREA_AVG | 0.852704 |
| 2785 | LIVINGAREA_MODE | LIVINGAPARTMENTS_AVG | 0.848094 |
| 3953 | TOTALAREA_MODE | ELEVATORS_AVG | 0.844223 |
| 3958 | TOTALAREA_MODE | LIVINGAPARTMENTS_AVG | 0.843714 |
| 3986 | TOTALAREA_MODE | LIVINGAPARTMENTS_MEDI | 0.842919 |
| 3352 | ELEVATORS_MEDI | LIVINGAREA_MODE | 0.840677 |
| 2780 | LIVINGAREA_MODE | ELEVATORS_AVG | 0.838720 |
| 3981 | TOTALAREA_MODE | ELEVATORS_MEDI | 0.837677 |
| 3355 | ELEVATORS_MEDI | APARTMENTS_MEDI | 0.836495 |
| 1395 | ELEVATORS_AVG | APARTMENTS_AVG | 0.835770 |
| 3327 | ELEVATORS_MEDI | APARTMENTS_AVG | 0.833806 |
| 2987 | APARTMENTS_MEDI | ELEVATORS_AVG | 0.833718 |
| 3972 | TOTALAREA_MODE | LIVINGAPARTMENTS_MODE | 0.830639 |
| 2375 | ELEVATORS_MODE | APARTMENTS_MODE | 0.825292 |
| 3001 | APARTMENTS_MEDI | ELEVATORS_MODE | 0.824968 |
| 2361 | ELEVATORS_MODE | APARTMENTS_AVG | 0.821398 |
| 3967 | TOTALAREA_MODE | ELEVATORS_MODE | 0.820175 |
| 3705 | LIVINGAPARTMENTS_MEDI | ELEVATORS_MEDI | 0.812395 |
| 3677 | LIVINGAPARTMENTS_MEDI | ELEVATORS_AVG | 0.811250 |
| 1745 | LIVINGAPARTMENTS_AVG | ELEVATORS_AVG | 0.809605 |
| 3341 | ELEVATORS_MEDI | APARTMENTS_MODE | 0.808021 |
| 3337 | ELEVATORS_MEDI | LIVINGAPARTMENTS_AVG | 0.806919 |
| 2725 | LIVINGAPARTMENTS_MODE | ELEVATORS_MODE | 0.806771 |
| 2021 | APARTMENTS_MODE | ELEVATORS_AVG | 0.805244 |
| 3691 | LIVINGAPARTMENTS_MEDI | ELEVATORS_MODE | 0.798788 |
| 3351 | ELEVATORS_MEDI | LIVINGAPARTMENTS_MODE | 0.797218 |
| 2711 | LIVINGAPARTMENTS_MODE | ELEVATORS_AVG | 0.795883 |
| 2371 | ELEVATORS_MODE | LIVINGAPARTMENTS_AVG | 0.792846 |
| 209 | AMT_GOODS_PRICE | AMT_ANNUITY | 0.774818 |
| 139 | AMT_ANNUITY | AMT_CREDIT | 0.769895 |
| 1609 | FLOORSMIN_AVG | FLOORSMAX_AVG | 0.744240 |
| 3569 | FLOORSMIN_MEDI | FLOORSMAX_MEDI | 0.742187 |
| 3473 | FLOORSMAX_MEDI | FLOORSMIN_AVG | 0.741814 |
| 3541 | FLOORSMIN_MEDI | FLOORSMAX_AVG | 0.741532 |
| 3555 | FLOORSMIN_MEDI | FLOORSMAX_MODE | 0.732262 |
| 2507 | FLOORSMAX_MODE | FLOORSMIN_AVG | 0.731580 |
| 2589 | FLOORSMIN_MODE | FLOORSMAX_MODE | 0.728786 |
| 3487 | FLOORSMAX_MEDI | FLOORSMIN_MODE | 0.724805 |
| 2575 | FLOORSMIN_MODE | FLOORSMAX_AVG | 0.723868 |
| 1810 | LIVINGAREA_AVG | BASEMENTAREA_AVG | 0.693555 |
| 3062 | BASEMENTAREA_MEDI | LIVINGAREA_AVG | 0.693067 |
| 3770 | LIVINGAREA_MEDI | BASEMENTAREA_MEDI | 0.691676 |
| 2790 | LIVINGAREA_MODE | BASEMENTAREA_MODE | 0.690143 |
| 3742 | LIVINGAREA_MEDI | BASEMENTAREA_AVG | 0.689844 |
| 3079 | BASEMENTAREA_MEDI | APARTMENTS_MEDI | 0.681423 |
| 3076 | BASEMENTAREA_MEDI | LIVINGAREA_MODE | 0.680609 |
| 1119 | BASEMENTAREA_AVG | APARTMENTS_AVG | 0.680266 |
| 1538 | FLOORSMAX_AVG | ELEVATORS_AVG | 0.680106 |
| 3051 | BASEMENTAREA_MEDI | APARTMENTS_AVG | 0.679620 |
| 2983 | APARTMENTS_MEDI | BASEMENTAREA_AVG | 0.679359 |
| 2099 | BASEMENTAREA_MODE | APARTMENTS_MODE | 0.678690 |
| 3470 | FLOORSMAX_MEDI | ELEVATORS_AVG | 0.677876 |
| 2776 | LIVINGAREA_MODE | BASEMENTAREA_AVG | 0.677440 |
| 3334 | ELEVATORS_MEDI | FLOORSMAX_AVG | 0.676187 |
| 3498 | FLOORSMAX_MEDI | ELEVATORS_MEDI | 0.675717 |
| 3949 | TOTALAREA_MODE | BASEMENTAREA_AVG | 0.674405 |
| 2096 | BASEMENTAREA_MODE | LIVINGAREA_AVG | 0.674000 |
| 3756 | LIVINGAREA_MEDI | BASEMENTAREA_MODE | 0.673202 |
| 3977 | TOTALAREA_MODE | BASEMENTAREA_MEDI | 0.671340 |
| 2504 | FLOORSMAX_MODE | ELEVATORS_AVG | 0.671115 |
| 3065 | BASEMENTAREA_MEDI | APARTMENTS_MODE | 0.669412 |
| 3348 | ELEVATORS_MEDI | FLOORSMAX_MODE | 0.669101 |
| 2017 | APARTMENTS_MODE | BASEMENTAREA_AVG | 0.666762 |
| 2997 | APARTMENTS_MEDI | BASEMENTAREA_MODE | 0.664287 |
| 2085 | BASEMENTAREA_MODE | APARTMENTS_AVG | 0.661368 |
| 2518 | FLOORSMAX_MODE | ELEVATORS_MODE | 0.661015 |
| 2368 | ELEVATORS_MODE | FLOORSMAX_AVG | 0.656377 |
| 3484 | FLOORSMAX_MEDI | ELEVATORS_MODE | 0.655841 |
| 2721 | LIVINGAPARTMENTS_MODE | BASEMENTAREA_MODE | 0.654791 |
| 2445 | ENTRANCES_MODE | BASEMENTAREA_MODE | 0.654728 |
| 2091 | BASEMENTAREA_MODE | ENTRANCES_AVG | 0.654112 |
| 3057 | BASEMENTAREA_MEDI | ENTRANCES_AVG | 0.653466 |
| 3411 | ENTRANCES_MEDI | BASEMENTAREA_MODE | 0.653085 |
| 3075 | BASEMENTAREA_MEDI | LIVINGAPARTMENTS_MODE | 0.652850 |
| 3701 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_MEDI | 0.652410 |
| 3425 | ENTRANCES_MEDI | BASEMENTAREA_MEDI | 0.651908 |
| 1465 | ENTRANCES_AVG | BASEMENTAREA_AVG | 0.651603 |
| 3673 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_AVG | 0.650068 |
| 3963 | TOTALAREA_MODE | BASEMENTAREA_MODE | 0.649858 |
| 2707 | LIVINGAPARTMENTS_MODE | BASEMENTAREA_AVG | 0.648827 |
| 3397 | ENTRANCES_MEDI | BASEMENTAREA_AVG | 0.647414 |
| 1741 | LIVINGAPARTMENTS_AVG | BASEMENTAREA_AVG | 0.647356 |
| 3061 | BASEMENTAREA_MEDI | LIVINGAPARTMENTS_AVG | 0.647106 |
| 3955 | TOTALAREA_MODE | FLOORSMAX_AVG | 0.632127 |
| 3071 | BASEMENTAREA_MEDI | ENTRANCES_MODE | 0.631743 |
| 1816 | LIVINGAREA_AVG | FLOORSMAX_AVG | 0.630896 |
| 3687 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_MODE | 0.629944 |
| 3983 | TOTALAREA_MODE | FLOORSMAX_MEDI | 0.629758 |
| 3476 | FLOORSMAX_MEDI | LIVINGAREA_AVG | 0.629017 |
| 3748 | LIVINGAREA_MEDI | FLOORSMAX_AVG | 0.627462 |
| 3776 | LIVINGAREA_MEDI | FLOORSMAX_MEDI | 0.626806 |
| 2431 | ENTRANCES_MODE | BASEMENTAREA_AVG | 0.626791 |
| 2510 | FLOORSMAX_MODE | LIVINGAREA_AVG | 0.626321 |
| 3969 | TOTALAREA_MODE | FLOORSMAX_MODE | 0.624911 |
| 2095 | BASEMENTAREA_MODE | LIVINGAPARTMENTS_AVG | 0.624439 |
| 3762 | LIVINGAREA_MEDI | FLOORSMAX_MODE | 0.624430 |
| 2795 | LIVINGAREA_MODE | ENTRANCES_MODE | 0.622706 |
| 2781 | LIVINGAREA_MODE | ENTRANCES_AVG | 0.622657 |
| 3421 | ENTRANCES_MEDI | LIVINGAREA_MODE | 0.622088 |
| 3747 | LIVINGAREA_MEDI | ENTRANCES_AVG | 0.619088 |
| 3775 | LIVINGAREA_MEDI | ENTRANCES_MEDI | 0.618682 |
| 1815 | LIVINGAREA_AVG | ENTRANCES_AVG | 0.618659 |
| 1533 | FLOORSMAX_AVG | APARTMENTS_AVG | 0.617992 |
| 419 | DAYS_EMPLOYED | DAYS_BIRTH | 0.615974 |
| 3465 | FLOORSMAX_MEDI | APARTMENTS_AVG | 0.615943 |
| 2444 | ENTRANCES_MODE | APARTMENTS_MODE | 0.615803 |
| 3407 | ENTRANCES_MEDI | LIVINGAREA_AVG | 0.614800 |
| 2989 | APARTMENTS_MEDI | FLOORSMAX_AVG | 0.614602 |
| 3493 | FLOORSMAX_MEDI | APARTMENTS_MEDI | 0.613842 |
| 2499 | FLOORSMAX_MODE | APARTMENTS_AVG | 0.613690 |
| 3410 | ENTRANCES_MEDI | APARTMENTS_MODE | 0.612217 |
| 2022 | APARTMENTS_MODE | ENTRANCES_AVG | 0.611973 |
| 3003 | APARTMENTS_MEDI | FLOORSMAX_MODE | 0.611940 |
| 2988 | APARTMENTS_MEDI | ENTRANCES_AVG | 0.611150 |
| 3424 | ENTRANCES_MEDI | APARTMENTS_MEDI | 0.610966 |
| 1464 | ENTRANCES_AVG | APARTMENTS_AVG | 0.610442 |
| 3396 | ENTRANCES_MEDI | APARTMENTS_AVG | 0.606808 |
| 2796 | LIVINGAREA_MODE | FLOORSMAX_MODE | 0.605487 |
| 833 | EXT_SOURCE_1 | DAYS_BIRTH | 0.600143 |
He optado por no eliminar las variables con altas correlaciones. Estas podrían ser candidatas a eliminación en pasos posteriores. Si en etapas futuras se utiliza un algoritmo que requiera la eliminación de la colinealidad, procederemos a eliminar las variables más correlacionadas para evitar posibles problemas en los resultados del modelo.
Por otro lado, comprobar la correlación de Spearman puede ser útil, ya que mide relaciones monótonas no lineales entre variables. Esto nos permitirá identificar patrones que podrían ser pasados por alto con Pearson.
funciones.get_corr_matrix(dataset = pd_loan_train[list_var_continuous],
metodo='spearman', size_figure=[10,8])
0
# Correlación de Spearman variables continuas
# ==============================================================================
corr = pd_loan_train[list_var_continuous].corr('spearman')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.6]
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 3192 | YEARS_BUILD_MEDI | YEARS_BUILD_AVG | 0.998426 |
| 3122 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_AVG | 0.997404 |
| 4198 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 0.997264 |
| 3612 | LANDAREA_MEDI | LANDAREA_AVG | 0.996263 |
| 3682 | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | 0.996227 |
| 3542 | FLOORSMIN_MEDI | FLOORSMIN_AVG | 0.996177 |
| 3262 | COMMONAREA_MEDI | COMMONAREA_AVG | 0.995847 |
| 3752 | LIVINGAREA_MEDI | LIVINGAREA_AVG | 0.995487 |
| 2982 | APARTMENTS_MEDI | APARTMENTS_AVG | 0.995211 |
| 3472 | FLOORSMAX_MEDI | FLOORSMAX_AVG | 0.994834 |
| 3052 | BASEMENTAREA_MEDI | BASEMENTAREA_AVG | 0.994811 |
| 3402 | ENTRANCES_MEDI | ENTRANCES_AVG | 0.993370 |
| 3332 | ELEVATORS_MEDI | ELEVATORS_AVG | 0.991105 |
| 3206 | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | 0.988352 |
| 2226 | YEARS_BUILD_MODE | YEARS_BUILD_AVG | 0.988016 |
| 1259 | YEARS_BUILD_AVG | YEARS_BEGINEXPLUATATION_AVG | 0.986683 |
| 3486 | FLOORSMAX_MEDI | FLOORSMAX_MODE | 0.986527 |
| 3556 | FLOORSMIN_MEDI | FLOORSMIN_MODE | 0.986364 |
| 3219 | YEARS_BUILD_MEDI | YEARS_BEGINEXPLUATATION_MEDI | 0.985764 |
| 3136 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | 0.985673 |
| 3191 | YEARS_BUILD_MEDI | YEARS_BEGINEXPLUATATION_AVG | 0.985239 |
| 2156 | YEARS_BEGINEXPLUATATION_MODE | YEARS_BEGINEXPLUATATION_AVG | 0.985236 |
| 208 | AMT_GOODS_PRICE | AMT_CREDIT | 0.984931 |
| 3123 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_AVG | 0.984369 |
| 2576 | FLOORSMIN_MODE | FLOORSMIN_AVG | 0.982634 |
| 3822 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAPARTMENTS_AVG | 0.982089 |
| 2506 | FLOORSMAX_MODE | FLOORSMAX_AVG | 0.981666 |
| 3892 | NONLIVINGAREA_MEDI | NONLIVINGAREA_AVG | 0.980895 |
| 2239 | YEARS_BUILD_MODE | YEARS_BEGINEXPLUATATION_MODE | 0.979828 |
| 3346 | ELEVATORS_MEDI | ELEVATORS_MODE | 0.977948 |
| 2225 | YEARS_BUILD_MODE | YEARS_BEGINEXPLUATATION_AVG | 0.975014 |
| 3137 | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MODE | 0.974845 |
| 3696 | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_MODE | 0.973776 |
| 1740 | LIVINGAPARTMENTS_AVG | APARTMENTS_AVG | 0.972337 |
| 3276 | COMMONAREA_MEDI | COMMONAREA_MODE | 0.971266 |
| 3626 | LANDAREA_MEDI | LANDAREA_MODE | 0.970980 |
| 2996 | APARTMENTS_MEDI | APARTMENTS_MODE | 0.970013 |
| 2716 | LIVINGAPARTMENTS_MODE | LIVINGAPARTMENTS_AVG | 0.969773 |
| 3700 | LIVINGAPARTMENTS_MEDI | APARTMENTS_MEDI | 0.969457 |
| 3205 | YEARS_BUILD_MEDI | YEARS_BEGINEXPLUATATION_MODE | 0.969325 |
| 3766 | LIVINGAREA_MEDI | LIVINGAREA_MODE | 0.969151 |
| 2366 | ELEVATORS_MODE | ELEVATORS_AVG | 0.969089 |
| 2157 | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_AVG | 0.968941 |
| 3672 | LIVINGAPARTMENTS_MEDI | APARTMENTS_AVG | 0.968643 |
| 2646 | LANDAREA_MODE | LANDAREA_AVG | 0.967403 |
| 2992 | APARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | 0.966373 |
| 2296 | COMMONAREA_MODE | COMMONAREA_AVG | 0.966022 |
| 3416 | ENTRANCES_MEDI | ENTRANCES_MODE | 0.965282 |
| 2016 | APARTMENTS_MODE | APARTMENTS_AVG | 0.964732 |
| 2786 | LIVINGAREA_MODE | LIVINGAREA_AVG | 0.964095 |
| 3066 | BASEMENTAREA_MEDI | BASEMENTAREA_MODE | 0.963150 |
| 2086 | BASEMENTAREA_MODE | BASEMENTAREA_AVG | 0.959563 |
| 2436 | ENTRANCES_MODE | ENTRANCES_AVG | 0.957920 |
| 2720 | LIVINGAPARTMENTS_MODE | APARTMENTS_MODE | 0.954911 |
| 3836 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAPARTMENTS_MODE | 0.951054 |
| 3006 | APARTMENTS_MEDI | LIVINGAPARTMENTS_MODE | 0.945932 |
| 2706 | LIVINGAPARTMENTS_MODE | APARTMENTS_AVG | 0.942804 |
| 3906 | NONLIVINGAREA_MEDI | NONLIVINGAREA_MODE | 0.941751 |
| 3959 | TOTALAREA_MODE | LIVINGAREA_AVG | 0.939730 |
| 3987 | TOTALAREA_MODE | LIVINGAREA_MEDI | 0.935073 |
| 3686 | LIVINGAPARTMENTS_MEDI | APARTMENTS_MODE | 0.932225 |
| 2856 | NONLIVINGAPARTMENTS_MODE | NONLIVINGAPARTMENTS_AVG | 0.931730 |
| 2026 | APARTMENTS_MODE | LIVINGAPARTMENTS_AVG | 0.928328 |
| 2926 | NONLIVINGAREA_MODE | NONLIVINGAREA_AVG | 0.918241 |
| 3973 | TOTALAREA_MODE | LIVINGAREA_MODE | 0.917356 |
| 1809 | LIVINGAREA_AVG | APARTMENTS_AVG | 0.905494 |
| 3769 | LIVINGAREA_MEDI | APARTMENTS_MEDI | 0.903639 |
| 3741 | LIVINGAREA_MEDI | APARTMENTS_AVG | 0.901653 |
| 2993 | APARTMENTS_MEDI | LIVINGAREA_AVG | 0.900570 |
| 3948 | TOTALAREA_MODE | APARTMENTS_AVG | 0.899388 |
| 1819 | LIVINGAREA_AVG | LIVINGAPARTMENTS_AVG | 0.895433 |
| 3976 | TOTALAREA_MODE | APARTMENTS_MEDI | 0.894203 |
| 2789 | LIVINGAREA_MODE | APARTMENTS_MODE | 0.892478 |
| 3779 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_MEDI | 0.891828 |
| 3683 | LIVINGAPARTMENTS_MEDI | LIVINGAREA_AVG | 0.891543 |
| 3751 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_AVG | 0.890115 |
| 3007 | APARTMENTS_MEDI | LIVINGAREA_MODE | 0.877175 |
| 3962 | TOTALAREA_MODE | APARTMENTS_MODE | 0.875080 |
| 3958 | TOTALAREA_MODE | LIVINGAPARTMENTS_AVG | 0.875076 |
| 2799 | LIVINGAREA_MODE | LIVINGAPARTMENTS_MODE | 0.874055 |
| 3755 | LIVINGAREA_MEDI | APARTMENTS_MODE | 0.873901 |
| 2775 | LIVINGAREA_MODE | APARTMENTS_AVG | 0.873060 |
| 3986 | TOTALAREA_MODE | LIVINGAPARTMENTS_MEDI | 0.871013 |
| 2027 | APARTMENTS_MODE | LIVINGAREA_AVG | 0.869356 |
| 3765 | LIVINGAREA_MEDI | LIVINGAPARTMENTS_MODE | 0.867738 |
| 2717 | LIVINGAPARTMENTS_MODE | LIVINGAREA_AVG | 0.865253 |
| 3697 | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MODE | 0.857555 |
| 2785 | LIVINGAREA_MODE | LIVINGAPARTMENTS_AVG | 0.854246 |
| 3972 | TOTALAREA_MODE | LIVINGAPARTMENTS_MODE | 0.852418 |
| 1538 | FLOORSMAX_AVG | ELEVATORS_AVG | 0.849696 |
| 3498 | FLOORSMAX_MEDI | ELEVATORS_MEDI | 0.845868 |
| 4268 | DEF_60_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | 0.845469 |
| 3334 | ELEVATORS_MEDI | FLOORSMAX_AVG | 0.844714 |
| 3470 | FLOORSMAX_MEDI | ELEVATORS_AVG | 0.842213 |
| 2518 | FLOORSMAX_MODE | ELEVATORS_MODE | 0.833152 |
| 3484 | FLOORSMAX_MEDI | ELEVATORS_MODE | 0.830800 |
| 139 | AMT_ANNUITY | AMT_CREDIT | 0.829875 |
| 2368 | ELEVATORS_MODE | FLOORSMAX_AVG | 0.829679 |
| 209 | AMT_GOODS_PRICE | AMT_ANNUITY | 0.827734 |
| 3348 | ELEVATORS_MEDI | FLOORSMAX_MODE | 0.826220 |
| 2504 | FLOORSMAX_MODE | ELEVATORS_AVG | 0.822261 |
| 1816 | LIVINGAREA_AVG | FLOORSMAX_AVG | 0.783167 |
| 3476 | FLOORSMAX_MEDI | LIVINGAREA_AVG | 0.781890 |
| 3776 | LIVINGAREA_MEDI | FLOORSMAX_MEDI | 0.779805 |
| 3955 | TOTALAREA_MODE | FLOORSMAX_AVG | 0.778176 |
| 3748 | LIVINGAREA_MEDI | FLOORSMAX_AVG | 0.777797 |
| 3983 | TOTALAREA_MODE | FLOORSMAX_MEDI | 0.777465 |
| 2510 | FLOORSMAX_MODE | LIVINGAREA_AVG | 0.776320 |
| 3969 | TOTALAREA_MODE | FLOORSMAX_MODE | 0.775992 |
| 3762 | LIVINGAREA_MEDI | FLOORSMAX_MODE | 0.774854 |
| 2796 | LIVINGAREA_MODE | FLOORSMAX_MODE | 0.760315 |
| 1533 | FLOORSMAX_AVG | APARTMENTS_AVG | 0.756388 |
| 3465 | FLOORSMAX_MEDI | APARTMENTS_AVG | 0.755318 |
| 3493 | FLOORSMAX_MEDI | APARTMENTS_MEDI | 0.752298 |
| 2499 | FLOORSMAX_MODE | APARTMENTS_AVG | 0.751424 |
| 2989 | APARTMENTS_MEDI | FLOORSMAX_AVG | 0.750350 |
| 3003 | APARTMENTS_MEDI | FLOORSMAX_MODE | 0.748986 |
| 3490 | FLOORSMAX_MEDI | LIVINGAREA_MODE | 0.747335 |
| 2782 | LIVINGAREA_MODE | FLOORSMAX_AVG | 0.744234 |
| 2513 | FLOORSMAX_MODE | APARTMENTS_MODE | 0.734892 |
| 1814 | LIVINGAREA_AVG | ELEVATORS_AVG | 0.727959 |
| 1747 | LIVINGAPARTMENTS_AVG | FLOORSMAX_AVG | 0.725171 |
| 3338 | ELEVATORS_MEDI | LIVINGAREA_AVG | 0.723597 |
| 3475 | FLOORSMAX_MEDI | LIVINGAPARTMENTS_AVG | 0.722890 |
| 1810 | LIVINGAREA_AVG | BASEMENTAREA_AVG | 0.720954 |
| 3707 | LIVINGAPARTMENTS_MEDI | FLOORSMAX_MEDI | 0.720549 |
| 3774 | LIVINGAREA_MEDI | ELEVATORS_MEDI | 0.720503 |
| 3679 | LIVINGAPARTMENTS_MEDI | FLOORSMAX_AVG | 0.720240 |
| 3479 | FLOORSMAX_MEDI | APARTMENTS_MODE | 0.719930 |
| 3746 | LIVINGAREA_MEDI | ELEVATORS_AVG | 0.718614 |
| 3742 | LIVINGAREA_MEDI | BASEMENTAREA_AVG | 0.717925 |
| 3953 | TOTALAREA_MODE | ELEVATORS_AVG | 0.717669 |
| 2509 | FLOORSMAX_MODE | LIVINGAPARTMENTS_AVG | 0.717595 |
| 2023 | APARTMENTS_MODE | FLOORSMAX_AVG | 0.717143 |
| 3770 | LIVINGAREA_MEDI | BASEMENTAREA_MEDI | 0.716459 |
| 3693 | LIVINGAPARTMENTS_MEDI | FLOORSMAX_MODE | 0.715837 |
| 3981 | TOTALAREA_MODE | ELEVATORS_MEDI | 0.714696 |
| 3062 | BASEMENTAREA_MEDI | LIVINGAREA_AVG | 0.714671 |
| 3949 | TOTALAREA_MODE | BASEMENTAREA_AVG | 0.713383 |
| 2372 | ELEVATORS_MODE | LIVINGAREA_AVG | 0.709315 |
| 3760 | LIVINGAREA_MEDI | ELEVATORS_MODE | 0.707961 |
| 3977 | TOTALAREA_MODE | BASEMENTAREA_MEDI | 0.706739 |
| 3967 | TOTALAREA_MODE | ELEVATORS_MODE | 0.706148 |
| 2790 | LIVINGAREA_MODE | BASEMENTAREA_MODE | 0.704663 |
| 2727 | LIVINGAPARTMENTS_MODE | FLOORSMAX_MODE | 0.701604 |
| 1119 | BASEMENTAREA_AVG | APARTMENTS_AVG | 0.701137 |
| 3076 | BASEMENTAREA_MEDI | LIVINGAREA_MODE | 0.699656 |
| 2776 | LIVINGAREA_MODE | BASEMENTAREA_AVG | 0.698836 |
| 2939 | NONLIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | 0.698836 |
| 2983 | APARTMENTS_MEDI | BASEMENTAREA_AVG | 0.698324 |
| 3079 | BASEMENTAREA_MEDI | APARTMENTS_MEDI | 0.697480 |
| 1395 | ELEVATORS_AVG | APARTMENTS_AVG | 0.696550 |
| 3051 | BASEMENTAREA_MEDI | APARTMENTS_AVG | 0.695305 |
| 1741 | LIVINGAPARTMENTS_AVG | BASEMENTAREA_AVG | 0.694236 |
| 3327 | ELEVATORS_MEDI | APARTMENTS_AVG | 0.693289 |
| 3673 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_AVG | 0.692509 |
| 3489 | FLOORSMAX_MEDI | LIVINGAPARTMENTS_MODE | 0.692438 |
| 1745 | LIVINGAPARTMENTS_AVG | ELEVATORS_AVG | 0.691953 |
| 3701 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_MEDI | 0.690324 |
| 2713 | LIVINGAPARTMENTS_MODE | FLOORSMAX_AVG | 0.690189 |
| 3355 | ELEVATORS_MEDI | APARTMENTS_MEDI | 0.689940 |
| 3919 | NONLIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | 0.688920 |
| 3337 | ELEVATORS_MEDI | LIVINGAPARTMENTS_AVG | 0.687962 |
| 3061 | BASEMENTAREA_MEDI | LIVINGAPARTMENTS_AVG | 0.687924 |
| 2794 | LIVINGAREA_MODE | ELEVATORS_MODE | 0.687518 |
| 2987 | APARTMENTS_MEDI | ELEVATORS_AVG | 0.687328 |
| 3705 | LIVINGAPARTMENTS_MEDI | ELEVATORS_MEDI | 0.686274 |
| 1959 | NONLIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | 0.685697 |
| 2099 | BASEMENTAREA_MODE | APARTMENTS_MODE | 0.685352 |
| 3677 | LIVINGAPARTMENTS_MEDI | ELEVATORS_AVG | 0.685043 |
| 2361 | ELEVATORS_MODE | APARTMENTS_AVG | 0.681356 |
| 3065 | BASEMENTAREA_MEDI | APARTMENTS_MODE | 0.679821 |
| 3001 | APARTMENTS_MEDI | ELEVATORS_MODE | 0.679675 |
| 2707 | LIVINGAPARTMENTS_MODE | BASEMENTAREA_AVG | 0.679040 |
| 2017 | APARTMENTS_MODE | BASEMENTAREA_AVG | 0.678830 |
| 3075 | BASEMENTAREA_MEDI | LIVINGAPARTMENTS_MODE | 0.678804 |
| 3756 | LIVINGAREA_MEDI | BASEMENTAREA_MODE | 0.677755 |
| 3891 | NONLIVINGAREA_MEDI | NONLIVINGAPARTMENTS_AVG | 0.677663 |
| 2371 | ELEVATORS_MODE | LIVINGAPARTMENTS_AVG | 0.675010 |
| 2721 | LIVINGAPARTMENTS_MODE | BASEMENTAREA_MODE | 0.674730 |
| 3691 | LIVINGAPARTMENTS_MEDI | ELEVATORS_MODE | 0.674205 |
| 3352 | ELEVATORS_MEDI | LIVINGAREA_MODE | 0.674027 |
| 3963 | TOTALAREA_MODE | BASEMENTAREA_MODE | 0.673629 |
| 2096 | BASEMENTAREA_MODE | LIVINGAREA_AVG | 0.672355 |
| 3823 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_AVG | 0.672205 |
| 2780 | LIVINGAREA_MODE | ELEVATORS_AVG | 0.667562 |
| 1609 | FLOORSMIN_AVG | FLOORSMAX_AVG | 0.664265 |
| 2997 | APARTMENTS_MEDI | BASEMENTAREA_MODE | 0.661743 |
| 2375 | ELEVATORS_MODE | APARTMENTS_MODE | 0.661641 |
| 1465 | ENTRANCES_AVG | BASEMENTAREA_AVG | 0.661493 |
| 3057 | BASEMENTAREA_MEDI | ENTRANCES_AVG | 0.661325 |
| 3837 | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MODE | 0.661223 |
| 3473 | FLOORSMAX_MEDI | FLOORSMIN_AVG | 0.661016 |
| 3569 | FLOORSMIN_MEDI | FLOORSMAX_MEDI | 0.660512 |
| 3425 | ENTRANCES_MEDI | BASEMENTAREA_MEDI | 0.659872 |
| 2725 | LIVINGAPARTMENTS_MODE | ELEVATORS_MODE | 0.659552 |
| 3541 | FLOORSMIN_MEDI | FLOORSMAX_AVG | 0.658525 |
| 3905 | NONLIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MODE | 0.657113 |
| 2085 | BASEMENTAREA_MODE | APARTMENTS_AVG | 0.655919 |
| 3397 | ENTRANCES_MEDI | BASEMENTAREA_AVG | 0.654026 |
| 2507 | FLOORSMAX_MODE | FLOORSMIN_AVG | 0.650605 |
| 3555 | FLOORSMIN_MEDI | FLOORSMAX_MODE | 0.650393 |
| 3351 | ELEVATORS_MEDI | LIVINGAPARTMENTS_MODE | 0.649455 |
| 3687 | LIVINGAPARTMENTS_MEDI | BASEMENTAREA_MODE | 0.648890 |
| 2445 | ENTRANCES_MODE | BASEMENTAREA_MODE | 0.647818 |
| 2711 | LIVINGAPARTMENTS_MODE | ELEVATORS_AVG | 0.644841 |
| 3341 | ELEVATORS_MEDI | APARTMENTS_MODE | 0.644342 |
| 2095 | BASEMENTAREA_MODE | LIVINGAPARTMENTS_AVG | 0.644305 |
| 2925 | NONLIVINGAREA_MODE | NONLIVINGAPARTMENTS_AVG | 0.644192 |
| 3411 | ENTRANCES_MEDI | BASEMENTAREA_MODE | 0.643684 |
| 2589 | FLOORSMIN_MODE | FLOORSMAX_MODE | 0.643038 |
| 2091 | BASEMENTAREA_MODE | ENTRANCES_AVG | 0.643034 |
| 2021 | APARTMENTS_MODE | ELEVATORS_AVG | 0.638970 |
| 3487 | FLOORSMAX_MEDI | FLOORSMIN_MODE | 0.638556 |
| 2575 | FLOORSMIN_MODE | FLOORSMAX_AVG | 0.635992 |
| 2857 | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_AVG | 0.635282 |
| 3071 | BASEMENTAREA_MEDI | ENTRANCES_MODE | 0.623683 |
| 3952 | TOTALAREA_MODE | COMMONAREA_AVG | 0.623165 |
| 3980 | TOTALAREA_MODE | COMMONAREA_MEDI | 0.618917 |
| 2431 | ENTRANCES_MODE | BASEMENTAREA_AVG | 0.616569 |
| 1815 | LIVINGAREA_AVG | ENTRANCES_AVG | 0.609925 |
| 3747 | LIVINGAREA_MEDI | ENTRANCES_AVG | 0.609351 |
| 3775 | LIVINGAREA_MEDI | ENTRANCES_MEDI | 0.606394 |
| 2781 | LIVINGAREA_MODE | ENTRANCES_AVG | 0.603576 |
| 3421 | ENTRANCES_MEDI | LIVINGAREA_MODE | 0.602933 |
| 3407 | ENTRANCES_MEDI | LIVINGAREA_AVG | 0.601584 |
| 3966 | TOTALAREA_MODE | COMMONAREA_MODE | 0.600723 |
La matriz de correlación de Spearman muestra relaciones monótonas, ya sean lineales o no lineales, y tiene la ventaja de ser menos sensible a los outliers. Por otro lado, la matriz de correlación de Pearson está enfocada en relaciones lineales. Ambas matrices brindan información útil sobre las dependencias entre las variables.
Al comparar ambas matrices, se puede ver que las relaciones entre las variables son bastante similares tanto en Pearson como en Spearman, lo que sugiere que la mayoría de las relaciones son monótonas. Sin embargo, en la matriz de Spearman se detectan algunas relaciones que no se habían visto en Pearson, lo que indica que existen patrones no lineales en los datos que la correlación lineal no logra captar.
Antes de decidir cómo manejar los valores nulos, es fundamental analizar su distribución en relación con la variable objetivo. Específicamente, es útil determinar si los valores faltantes se concentran en una clase específica de la variable objetivo o si su distribución es uniforme.
# Variables continuas
# ==============================================================================
list_var_continuous
['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
funciones.get_percent_null_values_target(pd_loan_train, list_var_continuous, target='TARGET')
| 0 | 1 | variable | sum_null_values | porcentaje_sum_null_values | |
|---|---|---|---|---|---|
| 0 | 1.000000 | 0.000000 | AMT_ANNUITY | 10 | 0.000041 |
| 1 | 0.921397 | 0.078603 | AMT_GOODS_PRICE | 229 | 0.000931 |
| 2 | 0.914958 | 0.085042 | OWN_CAR_AGE | 162379 | 0.660056 |
| 3 | 1.000000 | 0.000000 | CNT_FAM_MEMBERS | 1 | 0.000004 |
| 4 | 0.914898 | 0.085102 | EXT_SOURCE_1 | 138763 | 0.564059 |
| 5 | 0.921305 | 0.078695 | EXT_SOURCE_2 | 521 | 0.002118 |
| 6 | 0.907037 | 0.092963 | EXT_SOURCE_3 | 48826 | 0.198473 |
| 7 | 0.908635 | 0.091365 | APARTMENTS_AVG | 124774 | 0.507195 |
| 8 | 0.911045 | 0.088955 | BASEMENTAREA_AVG | 143904 | 0.584957 |
| 9 | 0.908013 | 0.091987 | YEARS_BEGINEXPLUATATION_AVG | 119930 | 0.487504 |
| 10 | 0.913224 | 0.086776 | YEARS_BUILD_AVG | 163466 | 0.664474 |
| 11 | 0.914296 | 0.085704 | COMMONAREA_AVG | 171765 | 0.698209 |
| 12 | 0.909062 | 0.090938 | ELEVATORS_AVG | 130969 | 0.532377 |
| 13 | 0.908334 | 0.091666 | ENTRANCES_AVG | 123764 | 0.503089 |
| 14 | 0.908227 | 0.091773 | FLOORSMAX_AVG | 122356 | 0.497366 |
| 15 | 0.913764 | 0.086236 | FLOORSMIN_AVG | 166833 | 0.678161 |
| 16 | 0.911874 | 0.088126 | LANDAREA_AVG | 145984 | 0.593412 |
| 17 | 0.913928 | 0.086072 | LIVINGAPARTMENTS_AVG | 168103 | 0.683323 |
| 18 | 0.908724 | 0.091276 | LIVINGAREA_AVG | 123428 | 0.501724 |
| 19 | 0.914332 | 0.085668 | NONLIVINGAPARTMENTS_AVG | 170752 | 0.694091 |
| 20 | 0.909811 | 0.090189 | NONLIVINGAREA_AVG | 135626 | 0.551307 |
| 21 | 0.908635 | 0.091365 | APARTMENTS_MODE | 124774 | 0.507195 |
| 22 | 0.911045 | 0.088955 | BASEMENTAREA_MODE | 143904 | 0.584957 |
| 23 | 0.908013 | 0.091987 | YEARS_BEGINEXPLUATATION_MODE | 119930 | 0.487504 |
| 24 | 0.913224 | 0.086776 | YEARS_BUILD_MODE | 163466 | 0.664474 |
| 25 | 0.914296 | 0.085704 | COMMONAREA_MODE | 171765 | 0.698209 |
| 26 | 0.909062 | 0.090938 | ELEVATORS_MODE | 130969 | 0.532377 |
| 27 | 0.908334 | 0.091666 | ENTRANCES_MODE | 123764 | 0.503089 |
| 28 | 0.908227 | 0.091773 | FLOORSMAX_MODE | 122356 | 0.497366 |
| 29 | 0.913764 | 0.086236 | FLOORSMIN_MODE | 166833 | 0.678161 |
| 30 | 0.911874 | 0.088126 | LANDAREA_MODE | 145984 | 0.593412 |
| 31 | 0.913928 | 0.086072 | LIVINGAPARTMENTS_MODE | 168103 | 0.683323 |
| 32 | 0.908724 | 0.091276 | LIVINGAREA_MODE | 123428 | 0.501724 |
| 33 | 0.914332 | 0.085668 | NONLIVINGAPARTMENTS_MODE | 170752 | 0.694091 |
| 34 | 0.909811 | 0.090189 | NONLIVINGAREA_MODE | 135626 | 0.551307 |
| 35 | 0.908635 | 0.091365 | APARTMENTS_MEDI | 124774 | 0.507195 |
| 36 | 0.911045 | 0.088955 | BASEMENTAREA_MEDI | 143904 | 0.584957 |
| 37 | 0.908013 | 0.091987 | YEARS_BEGINEXPLUATATION_MEDI | 119930 | 0.487504 |
| 38 | 0.913224 | 0.086776 | YEARS_BUILD_MEDI | 163466 | 0.664474 |
| 39 | 0.914296 | 0.085704 | COMMONAREA_MEDI | 171765 | 0.698209 |
| 40 | 0.909062 | 0.090938 | ELEVATORS_MEDI | 130969 | 0.532377 |
| 41 | 0.908334 | 0.091666 | ENTRANCES_MEDI | 123764 | 0.503089 |
| 42 | 0.908227 | 0.091773 | FLOORSMAX_MEDI | 122356 | 0.497366 |
| 43 | 0.913764 | 0.086236 | FLOORSMIN_MEDI | 166833 | 0.678161 |
| 44 | 0.911874 | 0.088126 | LANDAREA_MEDI | 145984 | 0.593412 |
| 45 | 0.913928 | 0.086072 | LIVINGAPARTMENTS_MEDI | 168103 | 0.683323 |
| 46 | 0.908724 | 0.091276 | LIVINGAREA_MEDI | 123428 | 0.501724 |
| 47 | 0.914332 | 0.085668 | NONLIVINGAPARTMENTS_MEDI | 170752 | 0.694091 |
| 48 | 0.909811 | 0.090189 | NONLIVINGAREA_MEDI | 135626 | 0.551307 |
| 49 | 0.907681 | 0.092319 | TOTALAREA_MODE | 118675 | 0.482403 |
| 50 | 0.961538 | 0.038462 | OBS_30_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| 51 | 0.961538 | 0.038462 | DEF_30_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| 52 | 0.961538 | 0.038462 | OBS_60_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| 53 | 0.961538 | 0.038462 | DEF_60_CNT_SOCIAL_CIRCLE | 832 | 0.003382 |
| 54 | 1.000000 | 0.000000 | DAYS_LAST_PHONE_CHANGE | 1 | 0.000004 |
| 55 | 0.896623 | 0.103377 | AMT_REQ_CREDIT_BUREAU_HOUR | 33199 | 0.134951 |
| 56 | 0.896623 | 0.103377 | AMT_REQ_CREDIT_BUREAU_DAY | 33199 | 0.134951 |
| 57 | 0.896623 | 0.103377 | AMT_REQ_CREDIT_BUREAU_WEEK | 33199 | 0.134951 |
| 58 | 0.896623 | 0.103377 | AMT_REQ_CREDIT_BUREAU_MON | 33199 | 0.134951 |
| 59 | 0.896623 | 0.103377 | AMT_REQ_CREDIT_BUREAU_QRT | 33199 | 0.134951 |
| 60 | 0.896623 | 0.103377 | AMT_REQ_CREDIT_BUREAU_YEAR | 33199 | 0.134951 |
Como se comentó en el notebook anterior, el muestreo está desbalanceado, con la mayoría de los clientes sin dificultades para pagar el préstamo. Además, los valores faltantes se concentran principalmente en la clase 0 de la variable objetivo (clientes sin problemas de pago), siguiendo la misma tónica del desbalance presente en los datos.
Dado que al principio no cuento con suficiente contexto sobre las variables, se pueden emplear diferentes enfoques y comparar los resultados del modelo. Las opciones son las siguientes:
Opción 0: Algunos algoritmos pueden manejar valores faltantes directamente, sin necesidad de imputarlos.
Opción 1: Eliminar filas con valores nulos. No obstante, esta opción no es ideal en mi caso, ya que, como se ha observado, hay una cantidad significativa de filas con datos faltantes.
Opción 2: Imputar los valores faltantes mediante técnicas estadísticas como la media, mediana, máximo, mínimo o incluso valores extremos.
Opción 3: Rellenar los valores faltantes utilizando modelos de regresión, como KNN, regresión lineal o XGBoost. Sin embargo, este enfoque podría implicar un alto costo computacional y requiere cuidado para evitar el sobreajuste.
En este trabajo he optado por la opción 2 para imputar los valores faltantes utilizando la media o mediana. Esto es ideal para completar los datos sin recurrir a métodos costosos ni distorsionar su distribución. Descarto el uso de valores extremos, ya que las hay columnas que representan diferencias de tiempo, con valores tanto negativos como positivos. Además, Utilizar extremos podría introducir sesgos y afectar la coherencia del análisis.
Por lo tanto, imputar los valores faltantes con la media o la mediana permite mantener la coherencia y el equilibrio de las columnas, asegurando que los valores reemplazados sean representativos de la tendencia general de los datos sin introducir sesgos indebidos.
Es importante recordar que los valores faltantes en el conjunto de test se imputan utilizando la media calculada a partir del conjunto de train.
pd_loan_train[list_var_continuous] = pd_loan_train[list_var_continuous].apply(lambda x: x.fillna(x.median()))
pd_loan_test[list_var_continuous] = pd_loan_test[list_var_continuous].apply(lambda x: x.fillna(x.median()))
pd_loan_train[list_var_continuous]
| AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | CNT_FAM_MEMBERS | HOUR_APPR_PROCESS_START | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 253977 | 405000.0 | 904500.0 | 38322.0 | 904500.0 | 0.015221 | -17192.0 | -5385.0 | -125.0 | -738.0 | 2.0 | 2.0 | 9.0 | 0.650515 | 0.498739 | 0.411849 | 0.0371 | 0.0251 | 0.9985 | 0.7552 | 0.0212 | 0.08 | 0.0345 | 0.5000 | 0.2083 | 0.0482 | 0.0756 | 0.04430 | 0.0000 | 0.0151 | 0.0378 | 0.0260 | 0.9985 | 0.7648 | 0.0191 | 0.0806 | 0.0345 | 0.5000 | 0.2083 | 0.0458 | 0.0771 | 0.0462 | 0.0000 | 0.0160 | 0.0375 | 0.0251 | 0.9985 | 0.7585 | 0.0209 | 0.08 | 0.0345 | 0.5000 | 0.2083 | 0.0487 | 0.0770 | 0.0451 | 0.0000 | 0.0155 | 0.0585 | 3.0 | 1.0 | 3.0 | 0.0 | -492.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 6.0 |
| 387015 | 202500.0 | 227520.0 | 8707.5 | 180000.0 | 0.007120 | -19885.0 | -627.0 | -7730.0 | -3437.0 | 9.0 | 2.0 | 7.0 | 0.506005 | 0.645398 | 0.385915 | 0.1856 | 0.0764 | 0.9886 | 0.7552 | 0.0212 | 0.00 | 0.4138 | 0.1667 | 0.2083 | 0.0482 | 0.0756 | 0.17040 | 0.0000 | 0.0036 | 0.1891 | 0.0747 | 0.9886 | 0.7648 | 0.0191 | 0.0000 | 0.4138 | 0.1667 | 0.2083 | 0.0458 | 0.0771 | 0.1776 | 0.0000 | 0.0011 | 0.1874 | 0.0759 | 0.9886 | 0.7585 | 0.0209 | 0.00 | 0.4138 | 0.1667 | 0.2083 | 0.0487 | 0.0770 | 0.1735 | 0.0000 | 0.0031 | 0.1341 | 7.0 | 0.0 | 7.0 | 0.0 | -1525.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 |
| 184784 | 225000.0 | 462825.0 | 33808.5 | 378000.0 | 0.072508 | -17351.0 | -9975.0 | -8836.0 | -886.0 | 5.0 | 2.0 | 14.0 | 0.742348 | 0.716670 | 0.265049 | 0.0629 | 0.0107 | 0.9727 | 0.6260 | 0.1092 | 0.00 | 0.1034 | 0.1667 | 0.2083 | 0.0109 | 0.0504 | 0.05060 | 0.0039 | 0.0051 | 0.0641 | 0.0110 | 0.9727 | 0.6406 | 0.1079 | 0.0000 | 0.1034 | 0.1667 | 0.2083 | 0.0111 | 0.0551 | 0.0526 | 0.0039 | 0.0054 | 0.0635 | 0.0107 | 0.9727 | 0.6310 | 0.1099 | 0.00 | 0.1034 | 0.1667 | 0.2083 | 0.0111 | 0.0513 | 0.0515 | 0.0039 | 0.0052 | 0.0408 | 0.0 | 0.0 | 0.0 | 0.0 | -2978.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 284885 | 135000.0 | 592560.0 | 28638.0 | 450000.0 | 0.032561 | -15563.0 | -2836.0 | -7890.0 | -4519.0 | 9.0 | 2.0 | 13.0 | 0.506005 | 0.800211 | 0.537070 | 0.2495 | 0.1847 | 0.9722 | 0.7552 | 0.0212 | 0.36 | 0.3103 | 0.3750 | 0.2083 | 0.0482 | 0.0756 | 0.35190 | 0.0000 | 0.2126 | 0.2542 | 0.1917 | 0.9722 | 0.7648 | 0.0191 | 0.3625 | 0.3103 | 0.3750 | 0.2083 | 0.0458 | 0.0771 | 0.3657 | 0.0000 | 0.1707 | 0.2519 | 0.1847 | 0.9722 | 0.7585 | 0.0209 | 0.36 | 0.3103 | 0.3750 | 0.2083 | 0.0487 | 0.0770 | 0.3582 | 0.0000 | 0.2170 | 0.3125 | 1.0 | 0.0 | 1.0 | 0.0 | -1245.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 146800 | 225000.0 | 927252.0 | 25627.5 | 774000.0 | 0.016612 | -15103.0 | -1629.0 | -3260.0 | -3846.0 | 1.0 | 3.0 | 18.0 | 0.506005 | 0.221765 | 0.260856 | 0.0876 | 0.0764 | 0.9816 | 0.7552 | 0.0212 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0482 | 0.0756 | 0.07465 | 0.0000 | 0.0036 | 0.0840 | 0.0747 | 0.9816 | 0.7648 | 0.0191 | 0.0000 | 0.1379 | 0.1667 | 0.2083 | 0.0458 | 0.0771 | 0.0731 | 0.0000 | 0.0011 | 0.0874 | 0.0759 | 0.9816 | 0.7585 | 0.0209 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0487 | 0.0770 | 0.0750 | 0.0000 | 0.0031 | 0.0688 | 6.0 | 0.0 | 6.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 141125 | 148500.0 | 814041.0 | 23931.0 | 679500.0 | 0.025164 | -16546.0 | -5329.0 | -4705.0 | -93.0 | 7.0 | 1.0 | 8.0 | 0.506005 | 0.643063 | 0.406617 | 0.0876 | 0.0764 | 0.9816 | 0.7552 | 0.0212 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0482 | 0.0756 | 0.07465 | 0.0000 | 0.0036 | 0.0840 | 0.0747 | 0.9816 | 0.7648 | 0.0191 | 0.0000 | 0.1379 | 0.1667 | 0.2083 | 0.0458 | 0.0771 | 0.0731 | 0.0000 | 0.0011 | 0.0874 | 0.0759 | 0.9816 | 0.7585 | 0.0209 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0487 | 0.0770 | 0.0750 | 0.0000 | 0.0031 | 0.0688 | 0.0 | 0.0 | 0.0 | 0.0 | -2405.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 425798 | 135000.0 | 1620000.0 | 56308.5 | 1620000.0 | 0.046220 | -16370.0 | -935.0 | -6868.0 | -3648.0 | 9.0 | 3.0 | 14.0 | 0.719393 | 0.688328 | 0.465069 | 0.1320 | 0.1491 | 0.9861 | 0.8096 | 0.0052 | 0.16 | 0.1379 | 0.3333 | 0.3750 | 0.0962 | 0.1076 | 0.08260 | 0.0000 | 0.1752 | 0.1345 | 0.1547 | 0.9861 | 0.8171 | 0.0052 | 0.1611 | 0.1379 | 0.3333 | 0.3750 | 0.0984 | 0.1175 | 0.0861 | 0.0000 | 0.1855 | 0.1332 | 0.1491 | 0.9861 | 0.8121 | 0.0052 | 0.16 | 0.1379 | 0.3333 | 0.3750 | 0.0979 | 0.1095 | 0.0841 | 0.0000 | 0.1788 | 0.1059 | 0.0 | 0.0 | 0.0 | 0.0 | -2593.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 401809 | 135000.0 | 495000.0 | 25402.5 | 495000.0 | 0.028663 | -8834.0 | -957.0 | -3380.0 | -1508.0 | 9.0 | 1.0 | 15.0 | 0.274011 | 0.515539 | 0.537070 | 0.0876 | 0.0764 | 0.9816 | 0.7552 | 0.0212 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0482 | 0.0756 | 0.07465 | 0.0000 | 0.0036 | 0.0840 | 0.0747 | 0.9816 | 0.7648 | 0.0191 | 0.0000 | 0.1379 | 0.1667 | 0.2083 | 0.0458 | 0.0771 | 0.0731 | 0.0000 | 0.0011 | 0.0874 | 0.0759 | 0.9816 | 0.7585 | 0.0209 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0487 | 0.0770 | 0.0750 | 0.0000 | 0.0031 | 0.0688 | 1.0 | 1.0 | 1.0 | 1.0 | -488.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 325030 | 112500.0 | 679500.0 | 19998.0 | 679500.0 | 0.019689 | -18439.0 | -2920.0 | -8460.0 | -1900.0 | 9.0 | 2.0 | 11.0 | 0.862565 | 0.172498 | 0.537070 | 0.0227 | 0.0764 | 0.9771 | 0.7552 | 0.0212 | 0.00 | 0.1034 | 0.0417 | 0.2083 | 0.0000 | 0.0756 | 0.01760 | 0.0000 | 0.0061 | 0.0231 | 0.0747 | 0.9772 | 0.7648 | 0.0191 | 0.0000 | 0.1034 | 0.0417 | 0.2083 | 0.0000 | 0.0771 | 0.0184 | 0.0000 | 0.0064 | 0.0229 | 0.0759 | 0.9771 | 0.7585 | 0.0209 | 0.00 | 0.1034 | 0.0417 | 0.2083 | 0.0000 | 0.0770 | 0.0180 | 0.0000 | 0.0062 | 0.0152 | 0.0 | 0.0 | 0.0 | 0.0 | -2395.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 221487 | 144000.0 | 521280.0 | 23089.5 | 450000.0 | 0.030755 | -23264.0 | 365243.0 | -467.0 | -3124.0 | 12.0 | 2.0 | 16.0 | 0.506005 | 0.700755 | 0.759712 | 0.0515 | 0.0652 | 0.9896 | 0.8572 | 0.0090 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0532 | 0.0420 | 0.05070 | 0.0000 | 0.0000 | 0.0525 | 0.0677 | 0.9896 | 0.8628 | 0.0090 | 0.0000 | 0.1379 | 0.1667 | 0.2083 | 0.0544 | 0.0459 | 0.0529 | 0.0000 | 0.0000 | 0.0520 | 0.0652 | 0.9896 | 0.8591 | 0.0090 | 0.00 | 0.1379 | 0.1667 | 0.2083 | 0.0541 | 0.0428 | 0.0517 | 0.0000 | 0.0000 | 0.0448 | 5.0 | 0.0 | 5.0 | 0.0 | -2401.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
246008 rows × 69 columns
# Dimensión del dataset de entrenamiento
# ==============================================================================
pd_loan_train.shape
(246008, 121)
Ahora verificamos que no hay variables con valores nulos, confirmando que el reemplazo se ha realizado con éxito.
funciones.get_percent_null_values_target(pd_loan_train, list_var_continuous, target='TARGET')
No existen variables con valores nulos
list_var_continuous = list(pd_loan_train.select_dtypes('float').columns)
funciones.get_corr_matrix(dataset = pd_loan_train[list_var_continuous],
metodo='pearson', size_figure=[10,8])
0
Dado que se han imputado los valores faltantes utilizando la mediana, al revisar la matriz de correlación de Pearson no se han observado diferencias significativas en comparación con los datos originales. Esto significa que el método de imputación ha funcionado bien y no ha alterado las relaciones entre las variables en el conjunto de datos.
El tratamiento de las variables categóricas incluye evaluar su relación y relevancia para el modelo. La medida V-Cramér evalúa la fuerza de la asociación entre dos variables categóricas, con valores cercanos a 1 indicando una fuerte relación. Este análisis ayuda a seleccionar variables relevantes y eliminar redundancias.
confusion_matrix = pd.crosstab(pd_loan_train["TARGET"], pd_loan_train["TARGET"])
funciones.cramers_v(confusion_matrix.values)
0.9999726127135284
Analizamos la medida V de Cramér para cada una de las variables categóricas.
list_var_cat = [var for var in list_var_cat if var != "TARGET"]
# Iterar sobre las variables categóricas
# ==============================================================================
for var in list_var_cat:
print(f"Variable: {var}")
# Matriz de confusión
confusion_matrix = pd.crosstab(pd_loan_train["TARGET"], pd_loan_train[var])
print("Confusion Matrix:")
print(confusion_matrix)
# Calcular Cramér's V
cramer_v_value = funciones.cramers_v(confusion_matrix.values)
print(f"Cramér's V: {cramer_v_value}")
print("-" * 50)
Variable: NAME_CONTRACT_TYPE Confusion Matrix: NAME_CONTRACT_TYPE Cash loans Revolving loans TARGET 0 204019 22129 1 18574 1286 Cramér's V: 0.03063340425944581 -------------------------------------------------- Variable: CODE_GENDER Confusion Matrix: CODE_GENDER F M XNA TARGET 0 150575 75569 4 1 11357 8503 0 Cramér's V: 0.05391925627367946 -------------------------------------------------- Variable: FLAG_OWN_CAR Confusion Matrix: FLAG_OWN_CAR N Y TARGET 0 148567 77581 1 13809 6051 Cramér's V: 0.021959051333866695 -------------------------------------------------- Variable: FLAG_OWN_REALTY Confusion Matrix: FLAG_OWN_REALTY N Y TARGET 0 69161 156987 1 6294 13566 Cramér's V: 0.006220696776432675 -------------------------------------------------- Variable: CNT_CHILDREN Confusion Matrix: CNT_CHILDREN 0 1 2 3 4 5 6 7 8 9 10 11 12 \ TARGET 0 159151 44392 19518 2692 307 62 11 6 1 0 1 0 2 1 13263 4371 1871 295 48 4 5 0 0 2 0 1 0 CNT_CHILDREN 14 19 TARGET 0 3 2 1 0 0 Cramér's V: 0.025469551087310804 -------------------------------------------------- Variable: NAME_TYPE_SUITE Confusion Matrix: NAME_TYPE_SUITE Children Family Group of people Other_A Other_B \ TARGET 0 2430 29773 197 633 1281 1 200 2401 16 60 144 NAME_TYPE_SUITE Spouse, partner Unaccompanied TARGET 0 8378 182458 1 710 16274 Cramér's V: 0.009818379355878933 -------------------------------------------------- Variable: NAME_INCOME_TYPE Confusion Matrix: NAME_INCOME_TYPE Businessman Commercial associate Maternity leave \ TARGET 0 7 53123 3 1 0 4277 2 NAME_INCOME_TYPE Pensioner State servant Student Unemployed Working TARGET 0 41876 16345 15 13 114766 1 2372 987 0 7 12215 Cramér's V: 0.06474470459848282 -------------------------------------------------- Variable: NAME_EDUCATION_TYPE Confusion Matrix: NAME_EDUCATION_TYPE Academic degree Higher education Incomplete higher \ TARGET 0 129 56749 7487 1 3 3208 706 NAME_EDUCATION_TYPE Lower secondary Secondary / secondary special TARGET 0 2727 159056 1 324 15619 Cramér's V: 0.05732372173178581 -------------------------------------------------- Variable: NAME_FAMILY_STATUS Confusion Matrix: NAME_FAMILY_STATUS Civil marriage Married Separated Single / not married \ TARGET 0 21432 145287 14514 32812 1 2380 11911 1287 3530 NAME_FAMILY_STATUS Unknown Widow TARGET 0 1 12102 1 0 752 Cramér's V: 0.03946471034470812 -------------------------------------------------- Variable: NAME_HOUSING_TYPE Confusion Matrix: NAME_HOUSING_TYPE Co-op apartment House / apartment Municipal apartment \ TARGET 0 820 201225 8195 1 76 16996 772 NAME_HOUSING_TYPE Office apartment Rented apartment With parents TARGET 0 1935 3439 10534 1 136 481 1399 Cramér's V: 0.03696771692760967 -------------------------------------------------- Variable: FLAG_MOBIL Confusion Matrix: FLAG_MOBIL 0 1 TARGET 0 1 226147 1 0 19860 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_EMP_PHONE Confusion Matrix: FLAG_EMP_PHONE 0 1 TARGET 0 41889 184259 1 2380 17480 Cramér's V: 0.046308016994052285 -------------------------------------------------- Variable: FLAG_WORK_PHONE Confusion Matrix: FLAG_WORK_PHONE 0 1 TARGET 0 181867 44281 1 15123 4737 Cramér's V: 0.029042708422753204 -------------------------------------------------- Variable: FLAG_CONT_MOBILE Confusion Matrix: FLAG_CONT_MOBILE 0 1 TARGET 0 411 225737 1 34 19826 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_PHONE Confusion Matrix: FLAG_PHONE 0 1 TARGET 0 161777 64371 1 14943 4917 Cramér's V: 0.022336405037527752 -------------------------------------------------- Variable: FLAG_EMAIL Confusion Matrix: FLAG_EMAIL 0 1 TARGET 0 213256 12892 1 18756 1104 Cramér's V: 0.0 -------------------------------------------------- Variable: OCCUPATION_TYPE Confusion Matrix: OCCUPATION_TYPE Accountants Cleaning staff Cooking staff Core staff \ TARGET 0 7508 3388 4282 20566 1 372 356 501 1389 OCCUPATION_TYPE Drivers HR staff High skill tech staff IT staff Laborers \ TARGET 0 13198 432 8523 378 39475 1 1682 26 573 24 4676 OCCUPATION_TYPE Low-skill Laborers Managers Medicine staff \ TARGET 0 1376 16097 6367 1 292 1074 445 OCCUPATION_TYPE Private service staff Realty agents Sales staff \ TARGET 0 1953 545 23217 1 146 48 2507 OCCUPATION_TYPE Secretaries Security staff Waiters/barmen staff TARGET 0 956 4790 956 1 70 559 126 Cramér's V: 0.08134143637324827 -------------------------------------------------- Variable: REGION_RATING_CLIENT Confusion Matrix: REGION_RATING_CLIENT 1 2 3 TARGET 0 24501 167226 34421 1 1274 14269 4317 Cramér's V: 0.05848726109560156 -------------------------------------------------- Variable: REGION_RATING_CLIENT_W_CITY Confusion Matrix: REGION_RATING_CLIENT_W_CITY 1 2 3 TARGET 0 25982 169037 31129 1 1360 14485 4015 Cramér's V: 0.06020767001429367 -------------------------------------------------- Variable: WEEKDAY_APPR_PROCESS_START Confusion Matrix: WEEKDAY_APPR_PROCESS_START FRIDAY MONDAY SATURDAY SUNDAY THURSDAY \ TARGET 0 36996 37321 24923 11869 37249 1 3263 3197 2135 1029 3265 WEEKDAY_APPR_PROCESS_START TUESDAY WEDNESDAY TARGET 0 39682 38108 1 3567 3404 Cramér's V: 0.0 -------------------------------------------------- Variable: REG_REGION_NOT_LIVE_REGION Confusion Matrix: REG_REGION_NOT_LIVE_REGION 0 1 TARGET 0 222738 3410 1 19519 341 Cramér's V: 0.004122438256299429 -------------------------------------------------- Variable: REG_REGION_NOT_WORK_REGION Confusion Matrix: REG_REGION_NOT_WORK_REGION 0 1 TARGET 0 214770 11378 1 18736 1124 Cramér's V: 0.007493967694399167 -------------------------------------------------- Variable: LIVE_REGION_NOT_WORK_REGION Confusion Matrix: LIVE_REGION_NOT_WORK_REGION 0 1 TARGET 0 217013 9135 1 19000 860 Cramér's V: 0.003427429834172142 -------------------------------------------------- Variable: REG_CITY_NOT_LIVE_CITY Confusion Matrix: REG_CITY_NOT_LIVE_CITY 0 1 TARGET 0 209221 16927 1 17498 2362 Cramér's V: 0.044601466415169586 -------------------------------------------------- Variable: REG_CITY_NOT_WORK_CITY Confusion Matrix: REG_CITY_NOT_WORK_CITY 0 1 TARGET 0 175453 50695 1 13873 5987 Cramér's V: 0.04994489201325126 -------------------------------------------------- Variable: LIVE_CITY_NOT_WORK_CITY Confusion Matrix: LIVE_CITY_NOT_WORK_CITY 0 1 TARGET 0 186379 39769 1 15481 4379 Cramér's V: 0.031606622004441996 -------------------------------------------------- Variable: ORGANIZATION_TYPE Confusion Matrix: ORGANIZATION_TYPE Advertising Agriculture Bank Business Entity Type 1 \ TARGET 0 320 1767 1926 4399 1 29 214 104 405 ORGANIZATION_TYPE Business Entity Type 2 Business Entity Type 3 Cleaning \ TARGET 0 7752 49319 185 1 731 5017 22 ORGANIZATION_TYPE Construction Culture Electricity Emergency Government \ TARGET 0 4762 285 707 414 7723 1 624 15 52 33 588 ORGANIZATION_TYPE Hotel Housing Industry: type 1 Industry: type 10 \ TARGET 0 694 2166 731 84 1 42 185 93 6 ORGANIZATION_TYPE Industry: type 11 Industry: type 12 Industry: type 13 \ TARGET 0 1989 276 51 1 195 13 8 ORGANIZATION_TYPE Industry: type 2 Industry: type 3 Industry: type 4 \ TARGET 0 347 2369 624 1 26 270 64 ORGANIZATION_TYPE Industry: type 5 Industry: type 6 Industry: type 7 \ TARGET 0 450 87 963 1 32 5 81 ORGANIZATION_TYPE Industry: type 8 Industry: type 9 Insurance \ TARGET 0 17 2460 450 1 2 176 28 ORGANIZATION_TYPE Kindergarten Legal Services Medicine Military Mobile \ TARGET 0 5126 223 8325 1996 236 1 389 19 567 104 24 ORGANIZATION_TYPE Other Police Postal Realtor Religion Restaurant \ TARGET 0 12332 1769 1577 297 66 1285 1 1014 90 149 37 4 167 ORGANIZATION_TYPE School Security Security Ministries Self-employed \ TARGET 0 6720 2350 1494 27568 1 412 261 70 3182 ORGANIZATION_TYPE Services Telecom Trade: type 1 Trade: type 2 \ TARGET 0 1183 418 252 1408 1 84 33 25 108 ORGANIZATION_TYPE Trade: type 3 Trade: type 4 Trade: type 5 Trade: type 6 \ TARGET 0 2491 53 38 491 1 280 2 3 26 ORGANIZATION_TYPE Trade: type 7 Transport: type 1 Transport: type 2 \ TARGET 0 5739 154 1622 1 603 8 145 ORGANIZATION_TYPE Transport: type 3 Transport: type 4 University XNA TARGET 0 798 3947 992 41881 1 149 410 56 2379 Cramér's V: 0.07182650528247596 -------------------------------------------------- Variable: FONDKAPREMONT_MODE Confusion Matrix: FONDKAPREMONT_MODE not specified org spec account reg oper account \ TARGET 0 4220 4221 55025 1 363 247 4143 FONDKAPREMONT_MODE reg oper spec account TARGET 0 9061 1 629 Cramér's V: 0.01640476811969389 -------------------------------------------------- Variable: HOUSETYPE_MODE Confusion Matrix: HOUSETYPE_MODE block of flats specific housing terraced house TARGET 0 112109 1064 885 1 8384 125 84 Cramér's V: 0.014271912094046454 -------------------------------------------------- Variable: WALLSMATERIAL_MODE Confusion Matrix: WALLSMATERIAL_MODE Block Mixed Monolithic Others Panel Stone, brick \ TARGET 0 6848 1667 1356 1183 49649 47962 1 508 147 68 110 3384 3831 WALLSMATERIAL_MODE Wooden TARGET 0 3891 1 419 Cramér's V: 0.029548956479788134 -------------------------------------------------- Variable: EMERGENCYSTATE_MODE Confusion Matrix: EMERGENCYSTATE_MODE No Yes TARGET 0 118735 1689 1 8885 184 Cramér's V: 0.012967084263474872 -------------------------------------------------- Variable: FLAG_DOCUMENT_2 Confusion Matrix: FLAG_DOCUMENT_2 0 1 TARGET 0 226139 9 1 19856 4 Cramér's V: 0.004608500646534983 -------------------------------------------------- Variable: FLAG_DOCUMENT_3 Confusion Matrix: FLAG_DOCUMENT_3 0 1 TARGET 0 66892 159256 1 4400 15460 Cramér's V: 0.04451636457882146 -------------------------------------------------- Variable: FLAG_DOCUMENT_4 Confusion Matrix: FLAG_DOCUMENT_4 0 1 TARGET 0 226129 19 1 19860 0 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_5 Confusion Matrix: FLAG_DOCUMENT_5 0 1 TARGET 0 222789 3359 1 19557 303 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_6 Confusion Matrix: FLAG_DOCUMENT_6 0 1 TARGET 0 205677 20471 1 18676 1184 Cramér's V: 0.029617893391359174 -------------------------------------------------- Variable: FLAG_DOCUMENT_7 Confusion Matrix: FLAG_DOCUMENT_7 0 1 TARGET 0 226105 43 1 19857 3 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_8 Confusion Matrix: FLAG_DOCUMENT_8 0 1 TARGET 0 207567 18581 1 18393 1467 Cramér's V: 0.007982449542583153 -------------------------------------------------- Variable: FLAG_DOCUMENT_9 Confusion Matrix: FLAG_DOCUMENT_9 0 1 TARGET 0 225251 897 1 19803 57 Cramér's V: 0.004229341123089524 -------------------------------------------------- Variable: FLAG_DOCUMENT_10 Confusion Matrix: FLAG_DOCUMENT_10 0 1 TARGET 0 226142 6 1 19860 0 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_11 Confusion Matrix: FLAG_DOCUMENT_11 0 1 TARGET 0 225241 907 1 19798 62 Cramér's V: 0.0031576417967910148 -------------------------------------------------- Variable: FLAG_DOCUMENT_12 Confusion Matrix: FLAG_DOCUMENT_12 0 1 TARGET 0 226146 2 1 19860 0 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_13 Confusion Matrix: FLAG_DOCUMENT_13 0 1 TARGET 0 225315 833 1 19833 27 Cramér's V: 0.01040621134565869 -------------------------------------------------- Variable: FLAG_DOCUMENT_14 Confusion Matrix: FLAG_DOCUMENT_14 0 1 TARGET 0 225421 727 1 19837 23 Cramér's V: 0.009822249260921416 -------------------------------------------------- Variable: FLAG_DOCUMENT_15 Confusion Matrix: FLAG_DOCUMENT_15 0 1 TARGET 0 225854 294 1 19851 9 Cramér's V: 0.006037171225511335 -------------------------------------------------- Variable: FLAG_DOCUMENT_16 Confusion Matrix: FLAG_DOCUMENT_16 0 1 TARGET 0 223802 2346 1 19749 111 Cramér's V: 0.012876070526313354 -------------------------------------------------- Variable: FLAG_DOCUMENT_17 Confusion Matrix: FLAG_DOCUMENT_17 0 1 TARGET 0 226082 66 1 19858 2 Cramér's V: 0.0017709715553645403 -------------------------------------------------- Variable: FLAG_DOCUMENT_18 Confusion Matrix: FLAG_DOCUMENT_18 0 1 TARGET 0 224251 1897 1 19748 112 Cramér's V: 0.007987104442843833 -------------------------------------------------- Variable: FLAG_DOCUMENT_19 Confusion Matrix: FLAG_DOCUMENT_19 0 1 TARGET 0 226010 138 1 19852 8 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_20 Confusion Matrix: FLAG_DOCUMENT_20 0 1 TARGET 0 226030 118 1 19849 11 Cramér's V: 0.0 -------------------------------------------------- Variable: FLAG_DOCUMENT_21 Confusion Matrix: FLAG_DOCUMENT_21 0 1 TARGET 0 226073 75 1 19853 7 Cramér's V: 0.0 --------------------------------------------------
Las variables analizadas muestran asociaciones débiles con la variable objetivo (TARGET), según los valores de Cramér's V, la mayoría menores a 0.1, indicando baja relación entre ellas y el incumplimiento de pagos. Esto sugiere que otras variables o enfoques podrían ser necesarios para mejorar el modelo predictivo.
Adicionalmente al coeficiente V de Cramer, he analizado en las variables categóricas el Weight of Evidence (WOE) y el Information Value (IV). El WOE mide la fuerza de la relación entre una variable categórica y la variable objetivo, transformando las categorías en valores continuos que reflejan el riesgo de un evento. Por su parte, el IV cuantifica la capacidad predictiva de una variable categórica, indicando qué tan bien una variable distingue entre las clases del objetivo.
# Iterar sobre las variables categóricas
# ==============================================================================
for var in list_var_cat:
print(f"Variable: {var}")
# Calcular WOE
woe_value = funciones.calculate_woe(pd_loan_train, 'TARGET', var)
print(f"IV: {woe_value}")
print("-" * 50)
Variable: NAME_CONTRACT_TYPE IV: TARGET WOE NAME_CONTRACT_TYPE Cash loans -0.036032 Revolving loans 0.412870 -------------------------------------------------- Variable: CODE_GENDER IV: TARGET WOE CODE_GENDER F 0.152145 M -0.247855 XNA inf -------------------------------------------------- Variable: FLAG_OWN_CAR IV: TARGET WOE FLAG_OWN_CAR N -0.056767 Y 0.118617 -------------------------------------------------- Variable: FLAG_OWN_REALTY IV: TARGET WOE FLAG_OWN_REALTY N -0.035642 Y 0.016114 -------------------------------------------------- Variable: CNT_CHILDREN IV: TARGET WOE CNT_CHILDREN 0 0.052393 1 -0.114415 2 -0.087618 3 -0.221418 4 -0.576835 5 0.308358 6 -1.644025 7 inf 8 inf 9 -inf 10 inf 11 -inf 12 inf 14 inf 19 inf -------------------------------------------------- Variable: NAME_TYPE_SUITE IV: TARGET WOE NAME_TYPE_SUITE Children 0.066497 Family 0.086884 Group of people 0.079783 Other_A -0.074707 Other_B -0.245249 Spouse, partner 0.037267 Unaccompanied -0.013881 -------------------------------------------------- Variable: NAME_INCOME_TYPE IV: TARGET WOE NAME_INCOME_TYPE Businessman inf Commercial associate 0.086876 Maternity leave -2.027017 Pensioner 0.438497 State servant 0.374525 Student inf Unemployed -1.813443 Working -0.192251 -------------------------------------------------- Variable: NAME_EDUCATION_TYPE IV: TARGET WOE NAME_EDUCATION_TYPE Academic degree 1.328718 Higher education 0.440508 Incomplete higher -0.071174 Lower secondary -0.302268 Secondary / secondary special -0.111714 -------------------------------------------------- Variable: NAME_FAMILY_STATUS IV: TARGET WOE NAME_FAMILY_STATUS Civil marriage -0.234697 Married 0.068767 Separated -0.009682 Single / not married -0.202986 Unknown inf Widow 0.345908 -------------------------------------------------- Variable: NAME_HOUSING_TYPE IV: TARGET WOE NAME_HOUSING_TYPE Co-op apartment -0.053911 House / apartment 0.038964 Municipal apartment -0.070187 Office apartment 0.222726 Rented apartment -0.465413 With parents -0.413632 -------------------------------------------------- Variable: FLAG_MOBIL IV: TARGET WOE FLAG_MOBIL 0 inf 1 -0.000004 -------------------------------------------------- Variable: FLAG_EMP_PHONE IV: TARGET WOE FLAG_EMP_PHONE 0 0.435441 1 -0.077197 -------------------------------------------------- Variable: FLAG_WORK_PHONE IV: TARGET WOE FLAG_WORK_PHONE 0 0.054577 1 -0.197330 -------------------------------------------------- Variable: FLAG_CONT_MOBILE IV: TARGET WOE FLAG_CONT_MOBILE 0 0.059751 1 -0.000106 -------------------------------------------------- Variable: FLAG_PHONE IV: TARGET WOE FLAG_PHONE 0 -0.050506 1 0.139483 -------------------------------------------------- Variable: FLAG_EMAIL IV: TARGET WOE FLAG_EMAIL 0 -0.001502 1 0.025185 -------------------------------------------------- Variable: OCCUPATION_TYPE IV: TARGET WOE OCCUPATION_TYPE Accountants 0.666909 Cleaning staff -0.084857 Cooking staff -0.192352 Core staff 0.357134 Drivers -0.277840 HR staff 0.472408 High skill tech staff 0.361717 IT staff 0.418919 Laborers -0.204697 Low-skill Laborers -0.787739 Managers 0.369322 Medicine staff 0.322888 Private service staff 0.255594 Realty agents 0.091663 Sales staff -0.112123 Secretaries 0.276341 Security staff -0.189785 Waiters/barmen staff -0.311445 -------------------------------------------------- Variable: REGION_RATING_CLIENT IV: TARGET WOE REGION_RATING_CLIENT 1 0.524070 2 0.028775 3 -0.356376 -------------------------------------------------- Variable: REGION_RATING_CLIENT_W_CITY IV: TARGET WOE REGION_RATING_CLIENT_W_CITY 1 0.517437 2 0.024522 3 -0.384379 -------------------------------------------------- Variable: WEEKDAY_APPR_PROCESS_START IV: TARGET WOE WEEKDAY_APPR_PROCESS_START FRIDAY -0.004319 MONDAY 0.024861 SATURDAY 0.024842 SUNDAY 0.012861 THURSDAY 0.001883 TUESDAY -0.023309 WEDNESDAY -0.017009 -------------------------------------------------- Variable: REG_REGION_NOT_LIVE_REGION IV: TARGET WOE REG_REGION_NOT_LIVE_REGION 0 0.002126 1 -0.129897 -------------------------------------------------- Variable: REG_REGION_NOT_WORK_REGION IV: TARGET WOE REG_REGION_NOT_WORK_REGION 0 0.006639 1 -0.117694 -------------------------------------------------- Variable: LIVE_REGION_NOT_WORK_REGION IV: TARGET WOE LIVE_REGION_NOT_WORK_REGION 0 0.003036 1 -0.069546 -------------------------------------------------- Variable: REG_CITY_NOT_LIVE_CITY IV: TARGET WOE REG_CITY_NOT_LIVE_CITY 0 0.048823 1 -0.463081 -------------------------------------------------- Variable: REG_CITY_NOT_WORK_CITY IV: TARGET WOE REG_CITY_NOT_WORK_CITY 0 0.104945 1 -0.296245 -------------------------------------------------- Variable: LIVE_CITY_NOT_WORK_CITY IV: TARGET WOE LIVE_CITY_NOT_WORK_CITY 0 0.055687 1 -0.226215 -------------------------------------------------- Variable: ORGANIZATION_TYPE IV: TARGET WOE ORGANIZATION_TYPE Advertising -0.031457 Agriculture -0.321420 Bank 0.486328 Business Entity Type 1 -0.047237 Business Entity Type 2 -0.071189 Business Entity Type 3 -0.147005 Cleaning -0.303169 Construction -0.400209 Culture 0.511957 Electricity 0.177305 Emergency 0.096876 Government 0.142749 Hotel 0.372320 Housing 0.027800 Industry: type 1 -0.370668 Industry: type 10 0.206575 Industry: type 11 -0.110094 Industry: type 12 0.622970 Industry: type 13 -0.580098 Industry: type 2 0.158746 Industry: type 3 -0.260681 Industry: type 4 -0.155215 Industry: type 5 0.211030 Industry: type 6 0.423988 Industry: type 7 0.043122 Industry: type 8 -0.292416 Industry: type 9 0.204951 Insurance 0.344561 Kindergarten 0.146020 Legal Services 0.030251 Medicine 0.254177 Military 0.522028 Mobile -0.146704 Other 0.065813 Police 0.545878 Postal -0.073149 Realtor -0.349668 Religion 0.370878 Restaurant -0.391962 School 0.359338 Security -0.234832 Security Ministries 0.628235 Self-employed -0.273336 Services 0.212510 Telecom 0.106492 Trade: type 1 -0.121929 Trade: type 2 0.135312 Trade: type 3 -0.246832 Trade: type 4 0.844663 Trade: type 5 0.106492 Trade: type 6 0.505866 Trade: type 7 -0.179359 Transport: type 1 0.525029 Transport: type 2 -0.017801 Transport: type 3 -0.754320 Transport: type 4 -0.167928 University 0.441889 XNA 0.435670 -------------------------------------------------- Variable: FONDKAPREMONT_MODE IV: TARGET WOE FONDKAPREMONT_MODE not specified -0.147711 org spec account 0.237540 reg oper account -0.014531 reg oper spec account 0.066705 -------------------------------------------------- Variable: HOUSETYPE_MODE IV: TARGET WOE HOUSETYPE_MODE block of flats 0.007387 specific housing -0.444282 terraced house -0.230988 -------------------------------------------------- Variable: WALLSMATERIAL_MODE IV: TARGET WOE WALLSMATERIAL_MODE Block 0.013956 Mixed -0.158926 Monolithic 0.405512 Others -0.211946 Panel 0.098645 Stone, brick -0.059991 Wooden -0.358724 -------------------------------------------------- Variable: EMERGENCYSTATE_MODE IV: TARGET WOE EMERGENCYSTATE_MODE No 0.006373 Yes -0.369201 -------------------------------------------------- Variable: FLAG_DOCUMENT_2 IV: TARGET WOE FLAG_DOCUMENT_2 0 0.000162 1 -1.621552 -------------------------------------------------- Variable: FLAG_DOCUMENT_3 IV: TARGET WOE FLAG_DOCUMENT_3 0 0.288993 1 -0.100225 -------------------------------------------------- Variable: FLAG_DOCUMENT_4 IV: TARGET WOE FLAG_DOCUMENT_4 0 -0.000084 1 inf -------------------------------------------------- Variable: FLAG_DOCUMENT_5 IV: TARGET WOE FLAG_DOCUMENT_5 0 0.000410 1 -0.026816 -------------------------------------------------- Variable: FLAG_DOCUMENT_6 IV: TARGET WOE FLAG_DOCUMENT_6 0 -0.033414 1 0.417629 -------------------------------------------------- Variable: FLAG_DOCUMENT_7 IV: TARGET WOE FLAG_DOCUMENT_7 0 -0.000039 1 0.230106 -------------------------------------------------- Variable: FLAG_DOCUMENT_8 IV: TARGET WOE FLAG_DOCUMENT_8 0 -0.008998 1 0.106438 -------------------------------------------------- Variable: FLAG_DOCUMENT_9 IV: TARGET WOE FLAG_DOCUMENT_9 0 -0.001100 1 0.323523 -------------------------------------------------- Variable: FLAG_DOCUMENT_10 IV: TARGET WOE FLAG_DOCUMENT_10 0 -0.000027 1 inf -------------------------------------------------- Variable: FLAG_DOCUMENT_11 IV: TARGET WOE FLAG_DOCUMENT_11 0 -0.000892 1 0.250526 -------------------------------------------------- Variable: FLAG_DOCUMENT_12 IV: TARGET WOE FLAG_DOCUMENT_12 0 -0.000009 1 inf -------------------------------------------------- Variable: FLAG_DOCUMENT_13 IV: TARGET WOE FLAG_DOCUMENT_13 0 -0.002330 1 0.996715 -------------------------------------------------- Variable: FLAG_DOCUMENT_14 IV: TARGET WOE FLAG_DOCUMENT_14 0 -0.002061 1 1.020950 -------------------------------------------------- Variable: FLAG_DOCUMENT_15 IV: TARGET WOE FLAG_DOCUMENT_15 0 -0.000848 1 1.053873 -------------------------------------------------- Variable: FLAG_DOCUMENT_16 IV: TARGET WOE FLAG_DOCUMENT_16 0 -0.004823 1 0.618455 -------------------------------------------------- Variable: FLAG_DOCUMENT_17 IV: TARGET WOE FLAG_DOCUMENT_17 0 -0.000191 1 1.064026 -------------------------------------------------- Variable: FLAG_DOCUMENT_18 IV: TARGET WOE FLAG_DOCUMENT_18 0 -0.002768 1 0.397048 -------------------------------------------------- Variable: FLAG_DOCUMENT_19 IV: TARGET WOE FLAG_DOCUMENT_19 0 -0.000208 1 0.415330 -------------------------------------------------- Variable: FLAG_DOCUMENT_20 IV: TARGET WOE FLAG_DOCUMENT_20 0 0.000032 1 -0.059693 -------------------------------------------------- Variable: FLAG_DOCUMENT_21 IV: TARGET WOE FLAG_DOCUMENT_21 0 0.000021 1 -0.060904 --------------------------------------------------
Teniendo en cuenta que la variable objetivo indica si el cliente ha tenido dificultades de pago (1) o no (0), los valores de WOE reflejan cómo las categorías de las variables afectan la probabilidad de que un cliente tenga problemas de pago.
Por ejemplo, en la variable de nivel educativo del cliente, el valor de WOE para los que tienen título académico (1.33) sugiere que estos clientes tienen una mayor probabilidad de no tener dificultades de pago. En cambio, los clientes con secundaria incompleta tienen un valor de WOE negativo (-0.30), lo que indica que tienen más probabilidades de experimentar dificultades de pago.
En cuanto al tipo de contrato, los préstamos revolventes tienen un WOE positivo (0.41), lo que sugiere que los clientes con este tipo de préstamo tienen menos probabilidades de tener problemas de pago. Por otro lado, los préstamos en efectivo tienen un WOE cercano a cero, lo que implica que hay una relación débil con la probabilidad de impago.
Como mencionamos antes, la variable de tipo de ingreso es clave. Los empresarios y estudiantes tienen un WOE infinito, lo que indica que tienen una relación muy fuerte con un bajo riesgo de impago. En cambio, los desempleados tienen un WOE negativo (-1.81), lo que refleja un alto riesgo de dificultades de pago.
También es relevante el WOE de la cantidad de hijos. A medida que el número de hijos aumenta, el WOE disminuye, lo que sugiere que los clientes con más hijos tienen una mayor probabilidad de enfrentar dificultades de pago.
Por último, en cuanto al tipo de ocupación, las ocupaciones de contables y gerentes tienen valores de WOE positivos, lo que indica un menor riesgo de impago, mientras que los trabajadores no calificados tienen un WOE negativo significativo (-0.78), lo que sugiere una mayor probabilidad de dificultades de pago.
# Iterar sobre las variables categóricas
for var in list_var_cat:
print(f"Variable: {var}")
# Calcular IV
iv_value = funciones.calculate_iv(pd_loan_train, 'TARGET', var)
print(f"IV: {iv_value}")
print("-" * 50)
Variable: NAME_CONTRACT_TYPE IV: 0.014858015991943418 -------------------------------------------------- Variable: CODE_GENDER IV: inf -------------------------------------------------- Variable: FLAG_OWN_CAR IV: 0.006729698450707212 -------------------------------------------------- Variable: FLAG_OWN_REALTY IV: 0.0005743146545654015 -------------------------------------------------- Variable: CNT_CHILDREN IV: inf -------------------------------------------------- Variable: NAME_TYPE_SUITE IV: 0.0016197070389929794 -------------------------------------------------- Variable: NAME_INCOME_TYPE IV: inf -------------------------------------------------- Variable: NAME_EDUCATION_TYPE IV: 0.050688378574963514 -------------------------------------------------- Variable: NAME_FAMILY_STATUS IV: inf -------------------------------------------------- Variable: NAME_HOUSING_TYPE IV: 0.015966243440681843 -------------------------------------------------- Variable: FLAG_MOBIL IV: inf -------------------------------------------------- Variable: FLAG_EMP_PHONE IV: 0.03352106867234879 -------------------------------------------------- Variable: FLAG_WORK_PHONE IV: 0.010760024355650568 -------------------------------------------------- Variable: FLAG_CONT_MOBILE IV: 6.3094537154537185e-06 -------------------------------------------------- Variable: FLAG_PHONE IV: 0.0070405936562062885 -------------------------------------------------- Variable: FLAG_EMAIL IV: 3.7837174519073465e-05 -------------------------------------------------- Variable: OCCUPATION_TYPE IV: 0.08707479245770457 -------------------------------------------------- Variable: REGION_RATING_CLIENT IV: 0.046986577089649706 -------------------------------------------------- Variable: REGION_RATING_CLIENT_W_CITY IV: 0.04925703840302959 -------------------------------------------------- Variable: WEEKDAY_APPR_PROCESS_START IV: 0.0003258102490672144 -------------------------------------------------- Variable: REG_REGION_NOT_LIVE_REGION IV: 0.00027613485680701556 -------------------------------------------------- Variable: REG_REGION_NOT_WORK_REGION IV: 0.0007813069024103844 -------------------------------------------------- Variable: LIVE_REGION_NOT_WORK_REGION IV: 0.0002111576276634438 -------------------------------------------------- Variable: REG_CITY_NOT_LIVE_CITY IV: 0.02256639203089647 -------------------------------------------------- Variable: REG_CITY_NOT_WORK_CITY IV: 0.031009112627587307 -------------------------------------------------- Variable: LIVE_CITY_NOT_WORK_CITY IV: 0.012583964428070134 -------------------------------------------------- Variable: ORGANIZATION_TYPE IV: 0.07578411185209898 -------------------------------------------------- Variable: FONDKAPREMONT_MODE IV: 0.004990206081286682 -------------------------------------------------- Variable: HOUSETYPE_MODE IV: 0.00283748372836535 -------------------------------------------------- Variable: WALLSMATERIAL_MODE IV: 0.013590653423688745 -------------------------------------------------- Variable: EMERGENCYSTATE_MODE IV: 0.00235238688232154 -------------------------------------------------- Variable: FLAG_DOCUMENT_2 IV: 0.0002620898427225545 -------------------------------------------------- Variable: FLAG_DOCUMENT_3 IV: 0.02889465905744434 -------------------------------------------------- Variable: FLAG_DOCUMENT_4 IV: inf -------------------------------------------------- Variable: FLAG_DOCUMENT_5 IV: 1.0990962270207618e-05 -------------------------------------------------- Variable: FLAG_DOCUMENT_6 IV: 0.013938602985122079 -------------------------------------------------- Variable: FLAG_DOCUMENT_7 IV: 8.994884732657452e-06 -------------------------------------------------- Variable: FLAG_DOCUMENT_8 IV: 0.0009576502613128974 -------------------------------------------------- Variable: FLAG_DOCUMENT_9 IV: 0.0003558963407377865 -------------------------------------------------- Variable: FLAG_DOCUMENT_10 IV: inf -------------------------------------------------- Variable: FLAG_DOCUMENT_11 IV: 0.00022345908427952863 -------------------------------------------------- Variable: FLAG_DOCUMENT_12 IV: inf -------------------------------------------------- Variable: FLAG_DOCUMENT_13 IV: 0.002321691599868672 -------------------------------------------------- Variable: FLAG_DOCUMENT_14 IV: 0.00210392745388146 -------------------------------------------------- Variable: FLAG_DOCUMENT_15 IV: 0.0008932023349686143 -------------------------------------------------- Variable: FLAG_DOCUMENT_16 IV: 0.0029821442220927335 -------------------------------------------------- Variable: FLAG_DOCUMENT_17 IV: 0.0002034136909195555 -------------------------------------------------- Variable: FLAG_DOCUMENT_18 IV: 0.001099029498291995 -------------------------------------------------- Variable: FLAG_DOCUMENT_19 IV: 8.618255864595225e-05 -------------------------------------------------- Variable: FLAG_DOCUMENT_20 IV: 1.9168626540533475e-06 -------------------------------------------------- Variable: FLAG_DOCUMENT_21 IV: 1.2688239108541118e-06 --------------------------------------------------
En resumen, las variables con un IV alto, como OCCUPATION_TYPE y NAME_EDUCATION_TYPE, tienen un mayor poder predictivo y son útiles para predecir las dificultades de pago. Por otro lado, las variables con un IV bajo o cercano a cero, como FLAG_MOBIL y FLAG_CONT_MOBILE, no aportan mucha información y podrían no ser relevantes para el modelo. En general, las variables con un IV más alto son las más importantes para el modelo, mientras que aquellas con IV bajo o infinito son candidatas a ser revisadas para evaluar su utilidad.
En las variables categóricas, los valores nulos suelen reemplazarse asignando una nueva categoría: "Sin valor".
pd_loan_train[list_var_cat] = pd_loan_train[list_var_cat].astype("object").fillna("SIN VALOR").astype("category")
pd_loan_test[list_var_cat] = pd_loan_test[list_var_cat].astype("object").fillna("SIN VALOR").astype("category")
Es importante recordar que los datos de tipo entero definidos como categóricos, incluidas las variables booleanas, no presentan valores nulos. Sin embargo, si los tuvieran, deberían haberse tratado como parte del manejo de valores nulos numéricos. Para las variables booleanas, una opción sería imputar los nulos con -1, mientras que para las demás variables categóricas numéricas, sería necesario analizarlas con más detalle. Siguiendo el enfoque empleado hasta ahora, podríamos imputar los nulos con la mediana.
Guardamos el DataFrame para conservar este nuevo estado intermedio y facilitar su uso en futuras etapas del análisis.
# Datos de entrenamiento
# ==============================================================================
pd_loan_train.to_csv("../data/interim/train_pd_data_preprocessing_missing_outlier.csv")
# Datos de test
# ==============================================================================
pd_loan_test.to_csv("../data/interim/test_pd_data_preprocessing_missing_outlier.csv")
print(pd_loan_train.shape, pd_loan_test.shape)
(246008, 121) (61503, 121)